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Preface 


This manual contains recovery procedures for complex systems environments. The customer may wish 
to tailor the document to the requirements of his installation. 


This document provides the basis tor a quick reference handbook for IBM ES/3090 operators. Emphasis 
in the book is on MVS/ESA recovery facilities and I/O recovery procedures. This edition of /BM ES/3090 
Complex Systems Recovery and Availability System Recovery Procedures is based on a publication that 
was previously published as /BM 3090 Models 400E and 600E System Recovery Procedures. Processor 
Complex recovery and recovery planning, a reconfiguration checklist and Processor Complex recovery 
procedures can be found in /BM ES/3090 Complex Systems Recovery and Availability Reconfiguration and 
Recovery Procedures. 


Since data processing installations vary widely, this book cannot include all procedures for all IBM ES/3090 
installations, or, indeed, all procedures for any installation. 


Currency 


One difficult aspect of any documentation dealing with a product subject to frequent changes is keeping 
the documeni current. Furthermore, ihere are several tevels of software and hardware installed at any 
one time, so there can be different correct procedures for certain situations. This document attempts to 
reflect the state of the products, both hardware and software, at the time of the final draft of this document 
(January 1989). Additional documentation is included about features that have been announced but are 
not necessarily available at this time in all installations. 


All procedures described in this bulletin were tested on an IBM 3090 Model 2008S at system EC level 223770 
with MVS/ESA Version 3 Release 1.0e (MVS/ESA 3.1.0e) installed, with several APARs that were available 
at the time when the system was built. 


Organization 


This manual is divided into 19 chapters, each of which addresses a different aspect of system recovery. 


Chapter One provides a reference table that lists symptoms and refers to the chapters in this book where 
the symptom is discussed. 


Chapter Two describes disabled wait states and gives directions about the actions to be taken for each 
of them. 


| Chapter Three offers wait state diagnostic procedures for wait state problems. The contents of this chapter 
are mainly referred to from the previous chapter and describe the use of the ALTCP frame in several sit- 
uations. 


Chapter Four gives an overview of appropriate actions if the system appears to be in an enabled wait state. 


Chapter Five offers procedures for handling problems that are common at IPL time and instructs the user 
on how to overcome these situations. 


Chapter Six offers procedures for handling disabled and enabled loops and instructs the user on how to 
perform a proper instruction trace. 


Chapter Seven describes what to do when one or more of the MVS consoles are lost. 


Chapter Eight introduces DCCF messages, how to recognize them, and how to respond to them. 


Preface iti 


Chapter Nine introduces the concept of device boxing. 
Chapter Ten describes how to recognize and how to react to missing interrupts. 


Chapter Eleven gives procedures for handling DPS devices, and shows the user how to recognize and re- 
synchronize DPS array out-of-sync situations. 


Chapter Twelve offers procedures for recognizing and removing write inhibit conditions. 


Chapter Thirteen describes 3990 SIM messages and gives procedures for unfencing all types of fencing 
conditions. 


Chapter Fourtteen describes how to decide whether an unconditional reserve channel command is to be 
issued by the system, especially when reserved devices are involved. 


Chapter Fifteen contains procedures for recognizing hot |/O and handling hot I/O wait states. 
Chapter Sixteen has procedures for handling channel path recovery situations. 
Chapter Seventeen describes the recognition and handling of I/O hang conditions. 


Chapter Eighteen describes the recognition and handling of GRS ring reconfiguration, recovery, and re- 
Start. 


Chapter Nineteen describes how to react to configuration and malfunction alerts. 


How to Tailor this Book 


This manual can be obtained from ITSC Poughkeepsie in machine-readable form to allow you to tailor it 
to your needs or to integrate it into your existing operator documentation. Either send the Reader’s 
Comment Form from the back of this publication or send a request, through your IBM representative, to 
userid /TSCMAN at WISCPOK. 


The book was created using the starter set of the Generalized Markup Language (GML) of Release 3 of the 
IBM Document Composition Facility Program Product (program number 5748-XX9) and can be formatted 
and printed by anyone having access to that product. 


We suggest you delete the procedures and paragraphs that are not needed for your installation and add 
your own recovery procedures, as appropriate. 


Related Publications 
The following publications provide additional information that may assist the presenter or student to obtain 
a fuller understanding of the subject area: 

IBM ES/3090 Processor Complex: Recovery Guide, SC38-0070 

MVS/ESA Planning: Recovery and Reconfiguration, GC28-1837 


IBM ES/3090: Complex Systems Recovery and Availability Configuration Considerations, Volume I: 
GG24-3340, Volume Il: GG24-3341, and Volume II]: GG24-3342 


IBM ES/3090: Complex Systems Recovery and Availability, Volume |: GG24-3343, Volume Il: GG24-3344, 
and Volume Ill: GG24-3345 


IBM ES/3090: Complex Systems Recovery and Availability Reconfiguration and Recovery Procedures, 
GG24-3347 


IBM ES/3090: Complex Systems Recovery and Availability Technical Guide, GG24-3348 
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IBM ES/3090: Complex Systems Recovery and Availability Exercise Guide, GG24-3349 


IBM ES/3090: Complex Systems Recovery and Availability Exercise Installation and Run-Time Proce- 
dures, GG24-3350 


IBM 3090 Processor Complex: Recovery Concepts, GG24-3077 


IBM/3090 Processor Complex: System Recovery in a Complex Environment, Volume |: GG24-3195, 
Volume Il: GG24-3196, and Volume Ill: GG24-3197 


Storage Subsystem Library: IBM 3990 Storage Control Introduction, GA32-0098 
Storage Subsystem Library: IBM 3990 Storage Control Reference, GA32-0099 


Storage Subsystem Library: IBM 3990 Storage Control Planning, Installation, and Administration 
Guide, GA32-0100 


System/370 Extended Architecture Reference Summary, GX20-0157 
MVS/ESA Diagnosis: Data Areas, Volume 5, LY28-1047 

MVS/Extended Architecture: System Messages, Volume 2, GC28-1377. 
MVS/ESA Message Library: System Messages, Volume 1, GC28-1812 
MVS/ESA Message Library: System Messages, Volume 2, GC28-1813 
MVS/ESA Message Library: System Codes, GC28-1815 

MVS/ESA System Programming Library: Initialization and Tuning: GC28-1828 


IBM Enterprise Systems Architecture/370: Principles of Operation, SA22-7200. 


Trademarks 


The following trademarks of the IBM Corporation are mentioned in this manual: 
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1.0 Using this Book 


Each chapter in this book deals with a separate aspect of system recovery. The way the information in a 
chapter is presented depends on the particular problems discussed in the chapter. Note, however, that 
the information is presented in the way that will make it easiest for you to use and tailor. For example, 
most procedures are presented on a single page, or on facing pages. Further, when a particular diagram 
is required for several procedures (as in Chapter 8) the diagram is repeated for each procedure. This 
arrangement provides you with all the needed information at the point you need it, and relieves you of the 
necessity for looking in two places to get a complete procedure. 


Within each chapter, the information is presented in a sequence that is either alphameric (such as diag- 
nostic procedures in procedure number order) or logical, with similar problems grouped together. 


Figure 1 through Figure 3 are reference tables that direct you to the appropriate part in this book. Find 
the problem under the headings “Action or Symptom” and refer to the section listed under “Task to Per- 
form”. 


Figure 4 through Figure 6 are reference tables that direct you to the actions to be taken if the listed 
message appears. Find the message number under the heading “Message” and refer to the section listed 
under “Task to Perform”. 


"Device Boxing" on page 65 
“Cancel Action" on page 62 


| Cause codes "DPS Device Messages — Problem Cross Reference" 
on page 82 


| Channel path recovery "Channel Path Recovery" on page 113 


CHPID alert "Alerts" on page 125 
Dasd ERP 


"Recognizing a Write ‘Inhibit Condition" on page 
87 


“Disabled Console Communication Facility" on 
page 59 


"Responding to Messages Issued Through DCCF" on 
| page 61 


DCCF messages 


"Responding to Messages Issued Through DCCF" on 


DCCF responses | 
| | page 61 


Figure 1. Cross Reference Table (Part 1 of 3) 
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| ACTION OR SYMPTOM 


| TASK TO PERFORM | 


"Device Boxing’ on page 65 


| Device boxed 


"Wait State Diagnostic Procedures’ on page 19 


| Diagnostic procedures 
"Disabled Loop" 


Disabled loop on page 46 
"Disabled Wait States" on page 7 


| Disabled wait 
"PDO1" on page 20 


| Display device number 
| from subchannel number 


| Display device number "PD02" on page 21 


| from UCB address 
DPS devices 


"Dynamic Pathing" on page 79 


| DPS device handling 


| DPS operational procedures ! 


"Handling Dynamic Pathing Devices" on page 79 


"Handling Dynamic Pathing Devices’ on page 79 


| DPS cross reference "DPS Device Messages — Problem Cross Reference’ 
| | | | on page 82 

|DPS array out—of—sync | " Recognizing DPS Array Out-of—Sync" on page 80 
DPS array resync. "Recovering a 3990/3380 Out of Syne Condition" 
| on page 81 

| Enabled loop | "Enabled Loop" on page 43 

Enabled wait | "Enabled Wait States" on page 33 

|GRS reconfiguration "GRS Ring Reconfiguration’ on page 119 


"Recovering from GRS Ring Disruption" on page 
121 


"Hot I/0" on page 101 
"Hot I/O Message — I0S111A" 


GRS recovery 


Hot I/O 
on page 105 


Hot I/O messages 


Hot I/O recognition "Recognizing Hot I/O" on page 101 


| Hot I/O wait states | “Handling Hot I/O Wait States’ on page 107 
Instruction trace | "Instruction Trace" on page 51 | 


I/O hang "T/O Hang Conditions’ on page 117 


| IML control unit "Handling Dynamic Pathing Devices" on page 79 


[icp 


Figure 2. Cross Reference Table (Part 2 of 3) 


"Loops on page 43 
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| ACTION OR SYMPTOM 
| Loop 


| Loop recording 


Missing interrupt (MIH) 


Missing interrupt messages 
Page data set 
Resynchronize DPS arrays 


Spin loop 


Spin loop handling 

Spin loop responses 

Spin loop time out 
Start pending 
Unconditional reserve U/R 
Wait state 


| Wait state codes 


Write inhibit 
Write inhibit handling 


Write inhibit messages 


| Write inhibit removal 


| TASK TO PERFORM 


"Loops' on page 43 

"Instruction Trace’ on page 51 

"Missing Interrupts" on page 69 

"Missing Interrupt Handler Messages" on page 69 
"Page Data Set Volume Error’ on page 99 

" Recognizing DPS Array Out—of-Sync" on page 80 
"Spin Loop" on page 47 | 
"Spin Loop’ on page 47 

"Handling Wait State 09X" on page 50 

"Spin Loop" on page 47 

"Handling Message I0S071I" on page 70 


"Unconditional Reserve’ on page 97 


"Disabled Wait States" on page 7 
"Disabled Wait States" on page 7 
"Write Inhibit" on page 87 


"Handling Write Inhibit Condition" on page 87 


"Messages Issued During Write Inhibit Process- 
| ing} on page 87 


"Removing a Write Inhibit Condition" on page 88 


Figure 3. Cross Reference Table (Part 3 of 3) 


ot) 
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MESSAGE _ 


IEAG66E PATH PERMANENT 
1/O ERROR 


| IEAG67E PATH WRITE 
INHIBITED FOR WRITES 


[EAS68E WRITE INHIBITED 
PATH ENCOUNTERED 


IEAG69E PATH HAS BEEN 
VARIED OFFLINE 


| TEAG69E PATH CANNOT BE 
VARIED OFFLINE 


| IEF281I ddd NOW OFFLINE 
DEVICE IS BOXED 


| 10S004I IOS RECOVERY 
FAILURE 


[0S062E ERROR ON CHANNEL 
| PATH~STOP SHARING SYSTEMS 
| 10S070E ddd,MOUNT PENDING | "Handling Message IOS070E" on page 69 | 
ptOsomSE “Handling Message IOSO75E" on page 71 | 
ptoso7eE "Handling Message IOSO76E" on page 71 
-I0SO77E | “Handling Message IOS077E" on page 71 
| I0S100I DEVICE ddd BOXED | "Device Boxing" on page 65 | ——- 
| 1081011 DEVICE ddd BOXED | "Device Boxing" on page 65 

| 1081021 DEVICE ddd BOXED =| "Device Boxing" on page 65 7 


| [OS102I DEVICE ddd BOXED | "Hot 1/0" on page 101 
| OPERATOR REQUEST 
| IOS104I DEVICE ddd BOXED "Device Boxing" on page 65 
| UNCOND RESERVE FAILED 

IO0S105I DEVICE ddd BOXED "Device Boxing’ on page 65 
| BY UNCOND RESERVE PROCESS 


Figure 4. Message Cross Reference Table (Part 1 of 3) 


PERFORM _ 


TASK TO 
Inhibit” 


Inhibit" 


Inhibit" 


“Tahibie 


sf ° ° Tf 
Device Boxing 


"Handling Message IOSOO4I" on page 116 


"Handling Message IO0S062E" on page 113 
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MESSAGE TASK TO PERFORM 


IOS106E VARY ddd OFFLINE "Device Boxing” on page 65 | 
TO JES3 
IOS109E HOT I/O RECOVERY "Hot I/0" on page 101 
INITIATED FOR DEV ddd 
IOS109E HOT I/O RECOVERY "Hot I/0"' on page 101 
INITIATED FOR DEV ddd 
IOS109E HOT I/O RECOVERY — "Hot 1/0" on page 101 
INITIATED FOR DEV ddd 
I0S203I CHANNEL PATH yy "Hot I/0" on page 101 
SUCCESSFULLY RECOVERED 
I0S110A IOS HAS DETECTED | "Hot I/O Message — I10S110A" on page 103 


HOT I/O ON DEVICE ddd 


IOS111A IOS HAS DETECTED 
HOT I/O ON DEVICE ddd 


10S112A IOS HAS DETECTED 
HOT I/O ON DEVICE ddd 


I0S113W IOS RECOVERY FAIL— 


"Hot I/O Message ~ I0S111A" on page 105 


"Hot I/O Message — I0S112A" on page 106 


"Handling Message IOS113W and Wait State 113" 
on page 116 


URE-RESERVES MAY BE LOST 


I0S162A CHPID XX ALERT | “Malfunction Alert — 108162A" on page 126 
I10S163A CHPID XX ALERTRS "Configuration Alert — I0S163A" on page 125 


I0S202I CHANNEL PATH yy "Hot 1/0" on page 101 
FORCED OFFLINE 

| IOS202I CHANNEL PATH cc "Channel Path Recovery’ on page 113 | 

! FORCED OFFLINE | 

I0S203I CHANNEL PATH cc "Channel Path Recovery’ on page 113 
SUCCESSFULLY RECOVERED 


I08S427A component FAILURE "Message I0S427A" on page 98 
REPLY WITH UR,BOX OR NOOP 


1084281 ddd,cc, RECOVERED "Message I0S428I" on. page 99 
THROUGH CHANNEL PATH zz 
I0S429I ddd NOT RECOVERED "Message I0S429I" on page 99 


: THROUGH ALT CHANNEL PATH | 


Figure 5. Message Cross Reference Table (Part 2 of 3) 
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MESSAGE TASK TO PERFORM 


IOS450E ddd,cc text " Recognizing DPS Array Out-—of-Sync" on page 80 


PATH TAKEN OFFLINE 
" Recognizing DPS Array Out-—of-Sync" on page 80 


I0S4511I ddd, BOXED, text 
I0S4521I ddd,cc text " Recognizing DPS Array Out-—of-Sync" on page 80 


ISGO022E Disrupted GRS | "Recovering from GRS Ring Disruption" on page 
| 121 

ISGO23E GRS Disrupted "Recovering from GRS Ring Disruption" on page 
121 


Figure 6. Message Cross Reference Table (Part 3 of 3) 


6 IBM ES/30906tm. Complex Systems Recovery and Availability 


2.0 Disabled Wait States 


A disabled wait state is a wait state in which the system will accept no Machine Check, External, or I/O 
interrupts (it is disabled for interrupts). A disabled wait state is characterized by the following: 


e No input is accepted at the master console. 


e A message is sent to the system console and the processor controller alarm is sounded when the PCE 
detects the loading of a disabled wait PSW. 


e The disabled wait message can be re-displayed by pressing the VIEWLOG key. 


The message has the form: 


oe Mice ns Pane le Dice ace bine hie ls Mine Fane lacs Pe line Paw eae wa Sc fe wl wn Vn mL ea Pe en fae mace aes Pe afin eas nase mac Face Pace alee mane ence eal ce aafine aan wo fae Pie Fae mfiie wf aS law lee 
AAS se ed ed ee ee A ee ee ee M S G IV GV ANIL IV GV EN TEIN ES EO EO TS EO TE EN IE TD ETON ERIN IN ID IE OS OD 


* CPy has entered disabled wait. * 
* PSW = O000A0000 00000nnn we 


* Intended Console: System 


* Detailed Information: The processor has loaded a wait PSW 
¥ which is disabled for all interrupts. 


* System Action: None. The processor remains in the operating 
* state but is not executing instructions. 


* User Action: Refer to operating system message and wait codes 
* publication for recommended action. 


as Nae alae we Pane aaDine malice mel ae eal an eave mPa mace an Ye anne eallnn wane on Vas malin bine ane mal na Ease malian bce wSrne malice mabe maPons fone mafics malas Mine Pian Pian waVae ma face ma Paes an Yone Nine anon matics Ninn online mace en Poe on Pinn mans enfin online Pe mn Paee a Pine male Pins Tins en Fore an Poct onbne eaY ne enfin malas ante Pine Pan we Tae 
hark ard eke ek ek a ed ek ee ek de ed ee ed ed er de ee ek en ek ee hd ed ed ek ek ee ed ee AS 


The PSW has the form: O0O0A0000 00000nnn 
p 


In the PSW, the digit marked by the ’p’ indicates that the system is in a wait state (bit 14 is on), and 
nnn’ is the wait state code. 


Some disabled wait state PSWs contain extra information and have the form: 


PSW=000A0000 O00Oxx0nnn 
PP 


The digits marked by ‘pp’ may contain: 
=» ACP identifier, in the case of a spin loop Wait09X, for example. 


» A reason code, in the case of a Wait055, for example. 


Disabled Wait States 7 


Disabled wait states are used: 


¢ To terminate MVS when the hardware or MVS detects an unrecoverable error (this is a non-restartable 
wait). 


¢ To communicate to the operator a condition that requires operator action when normal communication 
through the MVS operator console is not possible (this is a restartable wait). 


The Disabled Wait States Codes Reference Table, shown in Figure 7 through Figure 16, can be used as 
a quick reference chart for the operator to handle disabled wait states. The contents of this table are 
based on MVS/ESA System Codes. The table has been simplified and updated for the 3090 Processor 
Complex, but it should be used only by the operators; system programmers should refer to MVS/ESA 
System Codes for a full description. 


For specific situations where the operator action is more complicated and requires not only one, but a 
sequence of operations, a set of procedures is given in the pages following the reference table. The pro- 
cedures are named PDxx and are described under “Wait State Diagnostic Procedures” on page 19. 
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WAIT | | | 
STATE | REASON ACTION | 


002 HW failure during IPL Try again, if problem persists, call CE. 
003 HW failure during IPL Check for IPL device enabled or try to 
(IPL device) IPL from a different device. 
004 HW failure during IPL Try again, if problem persists, call CE. 
Isolate the failing unit using procedure 
"PDO1" on page 20. 
005 HW failure during IPL Make sure the IPL pack is ready 
(unit failure) and re-IPL the system. If IPL continues to 
fail try from your alternate IPL volume. See 
"PDO1" on page 20. 
006 | I/O error reading Call system programmer to check if the data 
set SYSI1.NUCLEUS is not in secondary extents. 
| SYS1.NUCLEUS at IPL Also, refer to "PDO9" on page 28. 
007 No console available Check consoles. 
| (NIP) Check IOGEN or MVSCP definition of consoles. 
Refer to "PD11" on page 31. 
100A =| No SYS1.LINKLIB in | Call system programmer. 
catalog 
OOB Master Scheduler ABEND | Try again and print the dump that was taken 
by the system. If second try is unsuccessful 
take a SADUMP and call the system programmer. 
00C | User error Notify system programmer. 
00D | Master Scheduler ABEND Try again. If unsuccessful, take a SADUMP and 
notify the system programmer. 
OOE User error Check that the alternate nucleus selection 
: was correct in the Load Parameter field. If 
| problem persists, call the system programmer. 
| OOF User error (No Correct the IPL address and try 
| IPL text on volume) again. 
013 | System error Take a SADUMP and try to re-IPL. 
, 014 | System error Take a SADUMP and try to re—-IPL. 
| 015 | Hardware problem (3092) | Re-IPL and notify hardware support. 
| euring- NIE 
Figure 7. Disabled Wait State Codes Table (Part 1 of 10) 
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WAIT | 
STATE | REASON 


Hardware problem (3092) 
during NIP 


ACTION 


Re-IPL and notify Hardware support. 


Hardware problem during | Isolate the failing unit using 


| IPL (unit check) 


User or Hardware error 


Slip trap match 
| (restartable) 


System error 


| Hardware error 
(3092). 


| I/O error on 
console at IPL 


User or hardware error 
Duplexed page 

Primary and secondary 
devices not ready 


3 ful, 


| 'PDO1" on page 20 and re—IPL. 


Often the result of pressing the START key. 


| This causes the non-IPL online processor to 
| enter WAIT STATE 019. 
| to re-IPL. 


It is not necessary 
Continue IPL procedure and the 
waiting processors will be started automat- 
ically by MVS/SP V2 and V3. 


Notify system programmer and follow his 


| instructions or your predefined installation 


procedure. Either restart the waiting CP or 
take a SADUMP and re—-IPL the system. 


Take a SADUMP and re-—IPL. 


Take a SADUMP and re—-IPL. 

Notify hardware support. Check the processor 
controller. Try to do a switch-over or IPL. 
If this does not work, partition the machine 
and use the side with a good 3092. 


Check the security key on the master console. 
Check the IOGEN or MVSCP definition of the 
master console.Try to re-IPL. If unsuccess- 
switch off the master console and IPL 
using the alternate console. Also, refer to 


"PD11" on page 31. 


Verify that the correct volume is mounted. 
Register 7 contains the UCB address of the 
verified device. Use "PDO2" on page 21 


| to check the VOLSER, and re~-IPL. 


Disabled Wait State Codes Table (Part 2 of 10) 
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WAIT | 
| STATE | REASON ACTION 


| 023 


| O24 


025 


system error 


System or hardware 
error 


Duplicate Nucleus entry 


| point during IPL 


028 


| 029 


| 02D 


| O2E 
030 


| 031 


032 


Invalid I/O 
configuration identifie 
specified 


TOD clock in error (NIP) 


| User Error (Accessing 


MSS during NIP) 


Hardware error on 
paging data sets 


| Abend during NIP 


| User error, no UCB 


for [PLdevice 


User, module missing 


in NUCLEUS 


| 033 
| 035 


Figure 9. 


I/O error during NIP 


| Entry point in Nucleus 
| not found during IPL 


Take a SADUMP and re-IPL. 


Take a SADUMP and re—IPL. 

If unsuccessful, check that the Read/Write 
switch on all system DASD devices is set to 
R/W. Notify the system programmer. 


Notify system programmer 
and take a SADUMP. 


Re-IPL with valid I/O configuration 
identifier in the second and third 

digits in the Load Parameter field on the 
OPRCTL 

or SYSCTL frames of the system console. 


Depress ‘TOD ENABLE’ and ‘ALT’ during IPL 
until first IPL message is received. 


Take a SADUMP and notify the system 
programmer. Do not access any MSS volumes 
during NTP. 


Notity the system programmer who will 
allocate new paging space. Run EREP. 


Take a SADUMP and re-IPL. Notify the system 
programmer. 


If possible, mount the IPL pack 

on a device you know to be SYSGENed and 
re-IPL. If not possible or still unsuccess- 
ful, take a SADUMP and notify the system 
programmer. 


Record the complete PSW, take a SADUMP. 
and notify the system programmer. 


Try to re-IPL. If unsuccessful, record the 
complete PSW and notify the system programmer. 


Notify the system programmer 
and take SADUMP. 


Disabled Wait State Codes Table (Part 3 of 10) 
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11 


WAIT | 
STATE | REASON ACTION 


User or hardware error | Record preceding messages, take a SADUMP, and 
notify the system programmer. 


Not enough real Check the configuration and 
storage to IPL notify the system programmer. 


User error (Volume Take a SADUMP and notify 
error at IPL) the system programmer. 


User error during CLPA | Record preceding messages and notify the 
| system programmer. 


Module not found in LPA| Record preceding messages and notify the 
system programmer. 


| User, page space | Increase space for page space 

shortage and re-IPL. The messages that precedes this 
wait code specifies which page space (COMMON, 
or LPA) was too small. 


Not enough real Check the configuration and 
storage for CSA notify the system programmer. Probably the 
CSA is too large. 


User error, page space | Notify system programmer and re-IPL after 
shortage during IPL increasing the available page space. 


User error (Invalid Notify the system programmer 
invocation of a NIP and take a SADUMP. 
| function) 


User or system error Record the full PSW, take a SADUMP, 
(ABEND during NIP) and notify the system programmer. 


| Machine check during The logical address of the CPU 

NIP is in bits 40-47 of the PSW. Re-IPL and, if 
error persists, configure the failing CPU 
offline. That will require a power-—on-reset. 
Report the problem to your hardware CE. 


User or system error Notify system programmer 
during IPL and take a SADUMP. 


| User or system error Notify system programmer 
during NIP and take a SADUMP. 


Figure 10. Disabled Wait State Codes Table (Part 4 of 10) 
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O4A 


050 
rsOSL 
| 052 


053 


| 054 


055 


| 059 
| 05C 
| O5D 


| O5E 


Figure 11. 


TOD clock error 


| Hardware error 


(multiple ACR) 


Software error 


during ACR 


Hardware error 


| during ACR 


| SQA has been exhausted 


Nucleus member error 


Module not found in 
SYS1.NUCLEUS 


| User or system error 


during IPL 


User or hardware error 


| during IPL 


| User or hardware error 


during NIP 


Hardware error 


| during NIP 


}WAIT | | oo 
STATE | REASON ACTION 


Restart the target CP. Press the 'TOD ENABLE’ 
and 'ALT' keys at the system console for se- 
veral seconds and IPL will continue. 


Take a SADUMP, re-IPL, and run EREP. 
Take a SADUMP, ce-IPL, and run EREP. 
Take a SAVUMP, re-IPL, and run EREP. 
Notify system programmer. 


Reason code in bits 40-43 of PSW. 
system programmer. 


Notify the 


Reason code in bits 40-43 of PSW. 
Most usual reason codes are: 


RC=01 — Dat-off nucleus module not found. 
Ensure Load Parameter first digit is cor- 
rect. 


RC=02 — Dat-—on nucleus module not found. | 

Ensure \we@eind and thistd Load Parameter est 

digita app correct. 
\4 


RC=03 - IPL information table not found 


( IOSIITXX a a <2 a ee SS! 2k 8 a 
RC=04 — Module list table not found 


Notify system programmer to provide correct 
member(s) in SYS1.NUCLEUS. 


Notify the system programmer 
and take a SADUMP. 


Take a SADUMP and re—-IPl.. If 


problem persists, notify system programmer. 


Take a SADUMP and re-IPL. If the problem 
persists, notify the system programmer. 


| Restore the master catalog to the proper 


volume. Try to re-IPL. If problem persists, 
notify the system programmer. 


Disabled Wait State Codes Table (Part 5 of 10) 


Disabled Wait States 13 


WAIT 
STATE REASON ACTION 


User error during NIP | Check SYSCATxx in NUCLEUS and re—IPL. 
System error Re-IPL with CLPA. 


ASM detected TOD clock Correct the TOD clock, 
error and re-IPL the system. 


Reserved device in STOP all sharing systems and restart 
channel path recovery the waiting CP. Completion of recovery is 
| signalled by message IOS201E or by wait state 
code 114. See "Handling Wait State 062" on 
page 114. 


User or system error ‘| Take a SADUMP and notify the system program- 
mer. 


|System error during NIP | Check Read/Write switch on DASD devices (must 
be on). Record full PSW contents. Take a 
SADUMP and notify the system programmer. 


Restart during NIP Probably RESTART Key depressed accidentally 
instead of TOD key. Re-IPL. 


Machine Check during NIH Refer to procedure "PD10" on page 30. 


System or user error Take a SADUMP and notify system programmer. 


Hardware error | Remove the CHPID indicated in storage location 
(channel) X'414' using procedure "PD05" on page 24. 
Run EREP. 


Hardware error Use procedure "PD06" on page 25 
(paging device) to recover. See also "Page Data Set Volume 
Error’ on page 99. 


| Not enough real storage | Check configuration to ensure enough on-line 
storage. Re-IPL with more real storage if 
possible; otherwise, notify system program- 
mer. 


Figure 12. Disabled Wait State Codes Table (Part 6 of 10) 
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WAIT 
STATE REASON ACTION 


071 


072 


073 


074 


075 


076 


077 


078 


079 


081 


083 


084 


085 


Figure 13. 


Not enough virtual 
storage 


User error 


Hardware error (missing 


interrupt during IPL) 


Error in IPL logic 


User error during IPL 


User error during IPL 
User error during IPL 


I/O error on master 
catalog 


I/O configuration 
incompatible with 
system code 


User error during IPL 


software error 


software error 


User error during IPL 


' contain the reason code. 


Notify the system programmer. 


Notify the system programmer. 


Bits 40-43 of the PSW contain reason code. 

If RC=01, the IPL program is waiting for an 
I/O interrupt (may be caused by a hardware 
error — device or control unit — or a reserve 
on the SYSRES issued by a sharing system). 
IF RC=02, the IPL program is waiting for an 
external interrupt (may be from the service 


processor). 


In both cases: try to IPL again. Contact 


hardware support if unsuccessful. 


Reason code is bits 36-43 of the PSW. Take 
a SADUMP, notify the system programmer, and 
run EREP. 


Notify the system programmer. Record the full 
PSW. Reason code is in bits 36-43 of the PSwW. 
Take a SADUMP, notify the system programmer. 


Notify the system programmer. 


Try to IPL again and notify 
the system programmer. 


Re-IPL with different 
I/O configuration, matching 
the MVS/XA release. 


Take a SADUMP, notify the system programmer. 


Bits 40-47 of the PSW 
Take a SADUMP and 
re-IPL the system. Notify the system pro- 
grammer. 


Stop all processors. 


Bits 40-47 of the PSW contain the reason code. 
Take a SADUMP, re-IPL the system, notify the 
system programmer, and run EREP. 


Re-IPL with the CLPA option. 


Disabled Wait State Codes Table (Part 7 of 10) 
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REASON 


| x=(1-E) spin loop 
| timeout 


Excessive spin loop 
condition 


Hardware error 
| (SIGP during IPL) 


| Hardware error 
| during reconfiguration 


| User error during IPL 


Hardware error 
(processor controller 


| Software error during 
NIP 


| Common area shortage 
| (system) 


| Real storage shortage 
| for SQA | 


Software 


Hot I/O (Non DASD) 


Hot I/O (Non-res. DASD) 


| Take a SADUMP, 
EREP. 


ACTION | 


| Use the procedures described in 
"Spin Loop" on page 47 to handle the spin 


loop. 


All recovery actions have been exhausted. 
| Notify system programmer.Take a SADUMP and 
| re-IPL. 


Try to re-IPL. If the error persists, try 


| to locate the failing CPU and to IML in a PP 
| configuration without the side that contains 
that CPU. 


See procedure 'PD03" on page 22. 


Notify the system programmer and 


| the hardware support personnel. 


Notify the system programmer. 


| Partition the machine and proceed 


on the good side. Call hardware support. 


| Bits 32-47 of the PSW contain 


the reason code. Take a SADUMP and notify the 


| system programmer. Re—-IPL. 


| Take a SADUMP and notify the system 
| programmer. 
| ification. 


Re-IPL with a bigger SQA spec- 


Take a SADUMP and notify the 


| system programmer. Re-IPL. 


re-IPL the system, and run 


| See "Handling Hot I/O Wait States" on page 
| 107. Run EREP. 


See "Handling Hot I/O Wait States" on page 
107. Run EREP. 


Figure 14. Disabled Wait State Codes Table (Part 8 of 10) 
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WAIT | 
STATE | REASON ACTION 


112 | Hot I/O (Res. DASD) See "Handling Hot I/0 Wait States'' on page 
107. Run EREP. 


113 | Channel path recovery Notify the system programmer of possible data 
| error integrity exposure. Refer to "Handling Mes- 
sage I0S113W and Wait State 113" on page 
116. 
114 System recovered from Refer to "Handling Wait State 114" on page 
ne Boy 


channel path error 


115 Page data set See "PDO4" on page 23. 
unavailable 


116 MIH on page device See "PDO7" on page 26. 
detected during Restart 


200 User error Record the PSW, save the preceding messages 
and notify the system programmer. 


| 201 System error Record the PSW with reason code in bits 32-47. 
Restart the target CP in wait (‘RESTART CPx'). 


(4202 System error Record the reason code using "PD08" on page 
27. Report the error to the system programmer 
and restart. 


| AOO System error | Record preceding message IEA802W. Take a 
SADUMP. Re-IPL and run EREP. 
A0l Hardware error | Re-IPL. Run EREP and notify 
- | MCH threshold reached hardware support. 
Al8 User or hardware error Check that all paging devices are ready and 
at the right address, and restart. If no 
| obvious error is found, take a SADUMP and 
re-IPL. 
| A19 | Hardware error Run EREP and call hardware support. 


(channel subsystem lost) 


| A20 System error Take a SADUMP. Notify the system programmer. 
| | Try to re-IPL without the FLPA parameter. 


| A21 System error | Take a SADUMP and notify the system program- 
| | mer. Try to re-IPL without the MLPA or FLPA 
| parameter. Record message IAROO3W. 


Figure 15. Disabled Wait State Codes Table (Part 9 of 10) 
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\WAIT | | 
STATE | REASON ACTION 


Wait because another The system resolves this wait state 
processor is in automatically. No operator action 
recovery required. 


Error during MCH | Take a SADUMP, re-IPL ,and run EREP. 
Loop during MCH Take a SADUMP, re-IPL and, run EREP. 


Invalid machine check | Take a SADUMP, re-IPL and, run EREP. 
code 


Hardware error (one CP) | Processing continues on other processors. 


This wait state does not affect the other CPUs 
in the complex. Run EREP. 


Error during MCH Take a SADUMP, re—-IPL and, run EREP. 


Error during system Take a SADUMP, re-IPL and, run EREP. 
termination 


User error | Take a SADUMP and notify the system program- 
mer. 


Quiesce performed Restart the waiting CP when you want to end 
quiesce. 


Figure 16. Disabled Wait State Codes Table (Part 10 of 16) 
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3.0 Wait State Diagnostic Procedures 


This chapter describes diagnostic procedures for wait state problems. 


Use of the Al TCP Frame 


The diagnostic procedures may involve displaying storage, the PSW, or the general purpose registers to 
obtain information. The following should be noted: 


1. 


When operating in LPAR mode, ensure that the system console is displaying the partition you want to 
alter. Enter the service language command: 


SETLP lpname 


to display the partition you want to use. 


It is not possible to use the Display function (ALTCP frame) if the CP is in the Load or disabled wait 
State. In this case, the PA frame and VIEWLOG key may be helpful, or it may be possible to display 
storage using another CP that is not in the Load or disabled wait state. 


Early in an IPL, before the virtual storage address translation tables have been set up, use the fol- 
lowing options to display storage: 


‘A2 B2’ (Display Real Storage) 


When the IPL is complete, storage in the address range X’000’-X’FFF’ may be displayed using either 
option ‘A2 B2’ (Display Real Storage) or ’A2 B3’ (Display Primary Virtual Storage), since the virtual 
and real addresses are the same. 


If all processors are in a disabled wait state, you can display storage to obtain diagnostic information 
by invoking the OPRCTL frame and selecting option ‘O03’ (SYSRESET). 


Note: A system reset causes the status of the subchannels to be reset in the Channel Subsystem. 
An IPL is necessary afterwards. This may be useful in the case of a WAITO2E to help isolate the failing 
paging device. 
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PD01 


THIS PROCEDURE PROVIDES THE DEVICE NUMBER FROM THE SUBCHANNEL NUMBER 


The subchannel number is found at location X’B8’ (184) in virtual storage. This procedure should be used 
for the following wait states: 


WAIT004 
WAITO05 
IPL Enabled Wait (Steps 1-9) 


Procedure 


1. Atthe system console, invoke the ALTER/DISPLAY frame with: 


“F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


—“A2 B2’ (Display Real Storage) 
4. Enter: 
‘B8’ at “Address(hex) = >’ 


5. Note the data starting at storage location X’B8’. Record or print the first four bytes ‘yyyyyyyy’: Out 
of those bytes, the last two (addresses X’BA’ and X’BB’) contain the subchannel number. 


6. Invoke the IOPD frame by entering: 
‘F IOPD’ or by selecting 08 on the INDEXO frame. 
7. When frame IOPD-00 is displayed, select “A5’ (Device Configuration). 
8. Enter the subchannel number found above (at addresses X’BA’ and X’BB’). 
9. Frame IOPD-50 is displayed (Device Configuration Display). 


This frame contains, for the selected subchannel number, the device number, the unit address, and 
the installed channel paths. 


10. Determine the failing path to the device using the lIOPD option A3. Observe the LPUM field to deter- 
mine the last path used. 


11. Use the service language commands to vary all but one path to the IPL volume offline (include the 
Supposedly failing path). 


12. Attempt to IPL again. 


13. It may be necessary to remove the device from the configuration. 
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PD02 


THIS PROCEDURE LOCATES THE DEVICE NUMBER FROM A GIVEN UCB ADDRESS 
The UCB address ts found in general register 7. 
This procedure should be used for the following wait staie: 
WAIT022 
Procedure 


1. Atthe system console, invoke the ALTER/DISPLAY frame with: 


‘PALTCR’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


‘A2 B5’ (Display General Registers) 
4. Print the frame or write down the contents of register 7, which contains the UCB address: X’aaaaaaaa’. 
5. Enter: 

‘A2 B3’ or just ’B3’ to display the general registers (since display mode is already in effect). 
6. Enter: 

‘aaaaaaaa’ (the UCB address obtained above) at ‘Address(hex) = >’ 


This displays the actual UCB. At offset X’OD’ there are three bytes, that contain the EBCDIC repre- 
sentation of the device number. At offset X’1C’ there are six bytes, that contain the EBCDIC repre- 
sentation of the VOLSER that is supposed to be at that device number. 


7. Verify that the correct volume is mounted and restart the system. 
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PDO3 


THIS PROCEDURE LOCATES THE FAILING CPU ADDRESS IN CASE OF SIGP FAILURE WAIT STATES 
This procedure should be used for the following wait state: 
WAITOEO 
Procedure 
1. At the system console, press the VIEWLOG key. 
2. Locate the message: 
‘OCOx CPy SIGP FAILED ............. : 
3. CPy is the failing CP. 
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PD04 


THIS PROCEDURE HELPS YOU LOCATE THE PAGING DEVICE NUMBER, AND TO RESTART THE SYSTEM 
IN SOME WAIT STATE SITUATIONS 


The address of the error information area is found at location X’40C’ in main storage. 
This procedure should be used for the following wait state: 

WAIT115 
Procedure 


1. At the system console, invoke the ALTER/DISPLAY frame with: 


‘F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


‘A2 B3’ (Display Primary Virtual Storage) 
4. Enter: 
‘40C’ at ‘Address(hex) = >’ 


5. Print the frame or write down the contents of location 40C (‘aaaaaaaa’), which is the address of the 
error information area. 


6. Enter: 
‘aaaaaaaa’ (the contents of location X’40C’ obtained above) at ‘Address(hex) = >’ 
This displays the contents of the error information area: 
e Offset X’4-7° contains the wait state code. 


e Offset X’10’ contains the reason code in hex: 


X’80’ = The pack mounted contains a different volume label from the one that 
was mounted at IPL time. 


X’40’ = Intervention required for the specified device. 
X°20" = Device not operational. 
X’10° = Permanent I/O error. 


e Offset X’12-13’ contains the device number. 


/7. If the reason code is X’80’, verify that the proper pack is mounted and enter RESTART at the system 
console. 


8. If the reason code is X’40’, ready the device and enter RESTART at the system console. 


9. If the reason code is X’10’ or X’20’, verify that the channel and the control unit are in normal state and 
enter RESTART at the system console. 
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PDOS5 


THIS PROCEDURE HELPS YOU LOCATE THE CHANNEL PATH IN ERROR, REMOVE THAT CHANNEL PATH, 
AND RESTART THE SYSTEM IN SOME WAIT STATE SITUATIONS 


The CHPID in error is found at location X’414’ in main storage. 

This procedure should be used for the following wait state: 
WAITO6C 

Procedure 


1. At the system console, invoke the ALTER/DISPLAY frame with: 


‘F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


‘A2 B3’ (Display Primary Virtual Storage) 
4. Enter: 
‘414’ at ‘Address(hex) = >’ 
to get the id of the channel path in error. 
9. Display the channel configuration frame by entering: 
‘F CHNCFA’ or by selecting option 04 from the INDEXO frame. 
6. Enter: 
‘CHPID xx OFF’ 
where ‘xx’ is the CHPID in error. 
7. Enter: 
“RESTART CPx’ 
where CPx is the CP in wait state. 
8. At the MVS operator console, enter: 
‘CF CHP(xx), OFFLINE, UNCOND’ 
for the defective CHPIDxx. 
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PD06 


THIS PROCEDURE HELPS YOU LOCATE THE PAGING DEVICE ADDRESS, AND TO RESTART THE SYSTEM 
IN SOME WAIT STATE SITUATIONS 


The address of the error information area is found at location X’40C’ in main storage. 
This procedure should be used for the following wait state: 

WAITO6F | 
Procedure 


1. Atthe system console, invoke the ALTER/DISPLAY frame with: 


‘F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


‘A2 B3’ (Display Primary Virtual Storage) 
4. Enter: 
‘A0C’ at ‘Address(hex) = >’ 


5. Print the frame or write down the contents of location 40C (‘aaaaaaaa’), which is the address of the 
error information area. 


6. Enter: 
‘aaaaaaaa’ (the contents of location X’40C’ obtained above) at “Address(hex) = >’ 
This gives you the contents of the error information area: 


e Offset X’0-1° contains the channel path identifier 
® Offset X’2-3" contains the device number 


/. Verify that the device is owned by the system. 
8. Depress the STOP key. 
9. Enter: 
‘A1 B3’ (Alter Primary Virtual Storage) 
10. Enter: 
“30E’ at “Address(hex) = >’ 


11. Move the cursor to location X’30E’ and enter one of the following recovery codes: 


OO = Retry to access the device without any recovery. If the problem persists, 
the wait state code ‘O6F’ is re-issued. | 

O1 = Recover access to the device through an alternate path. Because of data integrity exposure, 
quiesce any other system that has access to this device BEFORE entering X’01’. 


02 Force the device offline. 


12. After entering the above, restart the waiting CP by entering: 


“RESTART CPx’ 
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PDO7 


THIS PROCEDURE HELPS YOU LOCATE THE PAGING DEVICE ADDRESS, AND TO RESTART THE SYSTEM 
IN SOME WAIT STATE SITUATIONS | 


The device number of the failing paging device is found at location X’40C’ in main storage. 
This procedure should be used for the following wait state: 

WAIT116 
Procedure 


1. At the system console, invoke the ALTER/DISPLAY frame with: 


‘F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. | 

3. Enter: 


‘A2 B3’ (Display Primary Virtual Storage) 
4. Enter: 
‘40C’ at ‘Address(hex) =>’ 


5. Print the frame or write down the contents of location X’40C’ (‘aaaaaaaa’), which gives the device 
number of the failing paging device. 


6. Verify that the device is ready and restart the waiting CP with: 
“RESTART CPx’ 
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PDO8 


THIS PROCEDURE HELPS YOU LOCATE THE REASON CODE SUPPLIED FOR A WAIT202 
The reason code is found at location X’40C’ in main storage. 
This procedure should be used for the following wait state: 
WAIT202 | 
Procedure 


1. At the system console, invoke the ALTER/DISPLAY frame with: 


‘F ALTCP’ 

2. When the ALTCP frame is displayed, specify the number of a CP that is not in the Load or disabled 
wait state. 

3. Enter: 


‘A2 B3’ (Display Primary Virtual Storage) 
4. Enter: 
‘40C’ at “Address(hex) = >’ 


5. Print the frame or write down the contents of location X’40C’ (’0000800x’ or ’0000fccc’). Report this 
information to the System Programmer. 
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PDO9 


THIS PROCEDURE HELPS YOU LOCATE THE CAUSE OF AN I/O ERROR FOR A WAIT006 
This procedure should be used for the following wait state: 
WAITO06 
A WAITOO6 may indicate a hardware problem on a CHPID used to access the SYSRES device during IPL. 


The following messages may be issued at the hardware system console: 


LOAD failed. Interface control check. (35712) 
LOAD failed. Channel or device status not valid (35710) 


Procedure 

1. At the system console, invoke the IOPD frame: ‘’F IOPD’ 

2. When frame IOPD-00 is displayed, select option “A5’ (Device Configuration). 

3. Enter the device number of the Load address (SYSRES device). 

4. Note the installed channel paths to the device. 

5. At the service console, display the Interface Control Check Index frame (Figure 17): 


"F IFCC’ 
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07 NOV 88 13:39:32 
INTERFACE CONTROL CHECK INDEX <IFCC> 


A= SELECT LOG (1-50) 


1. 01 NOV 88 13:36:17.18 CHPID62 

2. 01 NOV 88 13:35:13.79 CHPID62 B= DISPLAY 

3. 29 OCT 88 00:03:24.13 CHPID2B X1. CHPID LOG 

4. 28 OCT 88 02:16:22.62 CHPID2B X2. INTERFACE TRACE 
5. 27 OCT 88 22:50:59.01 CHPID2B X3. CSAR TRACE 

6. 27 OCT 88 04:56:47.34 CHPID62 

7. 27 OCT 88 04:56:45.63 CHPID62 

8. 27 OCT 88 04:56:44.67 CHPID62 

9. 03 SEP 88 22:36:17.38 CHPID62 
10. 23 AUG 88 12:15:17.16 CHPID63 C= SELECT 
11. 23 AUG 88 12:15:17.57 CHPID63 -> 1. ALL CHPIDS 
12. 23 AUG 88 12:15:17.02 CHPID63 2. SPECIFIC CHPIDS 


13. 14 JUN 88 03:08:17.44 CHPID5C 
14. 14 JUN 88 03:08:17.35 CHPID5C 
15. 14 JUN 88 03:08:17.28 CHPID5C 


MORE DATA ABOVE AND BELOW: PRESS BKWD OR FWD. (59885) 


COMMAND ==> 


Figure 17. IFCC Frame 


The frame displays a log of the Interface Control Checks. 


If an IFCC is recorded for one of the CHPIDs to the SYSRES at the time of the WAITOO6, at the system 
console, enter the command: 


‘CHPID cc OFF’ 
where ‘cc’ is the CHPID with the IFCC. 
Try to re-IPL without the failing CHPID. 
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PD10 


THIS PROCEDURE HELPS YOU IDENTIFY THE CAUSE OF A MACHINE CHECK FOR A WAIT064-4 
This procedure should be used for the following wait state: 
WAIT064-4 (PSW = OOOA0000 00040064) 


A WAIT064-4 indicates a Machine Check interrupt was received during NIP processing. One possible 
cause is that an I/O interrupt was received from a device that has not been defined in the I[OCP. In this 
case, the Channel Subsystem cannot present an I/O interrupt since there is no subchannel associated with 
the device, and so it presents a Machine Check interrupt. 


Another cause of WAIT064-4 is changing the state of a resource during the MVS IPL. Do not do the fol- 
lowing during an MVS IPL: 


e =6lf running under VM/XA host, do not attach, detach or define any I/O device during the MVS IPL. 


e If running in LPAR mode, do not configure online or offline any CHPIDs to the logical partition in which 
MVS is being IPLed. 


lf these occur during NIP, a WAITO64-4 is loaded. 
Procedure 
1. At the system console, press the VIEWLOG key, and look for a message of the form: 


INTERRUPTION FROM DEVICE NOT IN IOCDS. CHPID=6C, UA=E4. (25197) 


2. Check that the correct IOCDS is in use. 


3. Report this problem and the full text of the message to the CE and the System Programmer. 
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PD11 


THIS PROCEDURE HELPS YOU RESOLVE WAIT007 AND WAIT021 


This procedure should be used for the following wait state: 


WAITO007 (PSW 
WAITO21 (PSW 


OOOAO0N00 00000007) 
QOOA0N00 00000021) 


A WAITOO7 indicates that no console was available during NIP. 


A WAITO21 indicates that an I/O error occurred on the main console following an I/O operation. 


Procedure 
1. Using the installation console configuration diagram, locate the master and first alternate consoles, 
and the control units. 
2. WAITO21. 
Using the ALTCP frame on the hardware system console, for the target IPL processor, use the fol- 
lowing commands: 
F ALTCP — select CP 
A2 B2 
address 'BO' 
to display location X’B8’-’BB’ to obtain the Subchannel ID of the last I/O interrupt (X’BA-BB’ has the 
SID). This may indicate the device’s subchannel number where MVS wrote the ‘Specify System Pa- 
rameters’ message . 
Use the IOPD frame to determine the device number associated with this subchannel. 
Use frame IOPD option A5 to determine the device number. 
Use the customer’s configuration diagrams to determine the device type. 
If the device number represents a VTAM CTCA, use the service language command ’CHPID cc OFF’, 
to vary the associated CHPID to the VTAM CTCA offline, and re-IPL 
3. Check the following: 


e Console security switch should be set in the correct position (if not, this is a typical case for a 
WAITO21). 


e Consoles powered on - check the display power on indicator. 
e Coax cables connected - use the test/normal switch and check the coaxial cable. 
e No Sub-Channel available; 


Use “F lOPD’ option A5 and the device number, to determine whether the device is supported 
(ES/3090 Basic Mode); or supported and in the logical partition (ES/3090 LPAR Mode). 


Check whether the correct [OCDS is used when a subchannel is not available to support the de- 
vice. Also check the IOCP input. 


For LPAR Mode only: When a device does not have a subchannel allocated to the target logical 
partition, the following message is displayed on the system console. 


REQUESTED DATA NOT DEFINED FOR LOGICAL PARTITION xxxx (60761) 


e Channels available 
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Use IOPD frame option A5 and the device number to determine the paths defined to that device 
and use the CHNCFA frame to determine the state of the CHPIDs. 


¢ Control unit powered on - check the control unit power indicators. 
e Control unit online - check the control unit online/offline switch. 
e Control unit IMLed - re-IML the control unit. 

e Console UCB online - check for any I/O generation changes. 


e Check for “coax patch panel’ changes. 
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4.0 Enabled Wait States 


An enabled wait state exists when MVS finds no work to dispatch in the system. This may be a normal 
state, when the system is idle for example, or it may be a symptom of a problem. 


Typically, line 24 of the system console will appear as shown in Figure 18 during an enabled wait state. 


ITSC—POK Oo eeWe 2 eaWs 2 ecWen BW. OS 2c We Dae We PSW1 Operating 


Figure 18. Hardware System Console Line 24 


Prior to placing a processor into a disabled wait state, MVS has detected a problem. MVS loads a coded, 
disabled wait state PSW to indicate the cause of the problem to the operator. Refer to “Disabled Wait 
States” on page 7. In an enabled wait state, MVS is not aware of a problem, even though one may exist, 
and the PSW has the form: 


by pe a : so is i F : 4 oe | . Ao y Ce te g “ he Page - a 
PSW = 0O70E0000 00000000 


This is called the ‘no-work wait’ or “dummy wait’, because MVS loads this PSW when there is no work to 
dispatch in the system. 


One of the characteristics of an enabled wait state is that communication with the operating system is still 
possible through the master console. 


Check at the system console to determine whether the system is really in a wait or if a high priority job is 
looping. For that purpose, one of the SAD frames should be set up to display the utilization of all CPUs. 


Also, check for outstanding replies to messages. Enabled wait states usually indicate that the system is 
waiting for: 


¢ Work 


A problem in a subsystem - for example, JES2 - may prevent new work from starting. If MVS/ESA 
appears to be responding to commands, the subsystems should be investigated. 


® Operator action or response 
An outstanding operator response may cause a bottleneck. 
e Missing interrupt 


If a paging device has a missing interrupt condition, the operator may not be made aware of the 
problem because some system communications routines are pageable. In this case, the system may 
enter an enabled wait state. 


e A system resource 


When a lengthy, unexplained enabled wait state occurs, and you believe there is work for the system 
to do, the availability of system resources should be checked. 


» Enqueue Lockout 


Critical system resources may be enqueued and not released. The components not freeing the 
resources should be investigated. GRS and RMF can be used to display resource contention. 
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» Missing Resources 


Critical system resources may have been removed during previous error recovery. For example, 
during Unconditional Reserve recovery a critical DASD path may have been taken offline. 


CONFIG members should be set up to reflect all critical system resources: 
A Processors 

A Storage - expanded and central 

4A CHPIDs 

A Critical DASD devices and paths 


It can be difficult to determine the cause of an enabled wait state, since the problem may not be imme- 
diately visible to the operator. There are, however, some things you can do to help determine the cause 
of a wait. 


Procedures 
Two procedures are provided : 
1. PROCEDURE 1: For a system already IPLed. 
2. PROCEDURE 2: For a particular enable wait state during IPL operations. 
Procedure 1 - System already IPLed 
1. Issue: 
‘D R,L’ 


to check for outstanding operator action. Ensure all DDR swap requests are resolved, and all requests 
to bring offline devices online (outstanding message IEF238D) are satisfied. 


2. Issue: 
‘D M=CONFIG(xx)’ 
to check for any missing critical resources. 
3. Issue: 
‘DA,L’ 
to see if there is ready work waiting to execute. 


Look for a job step name of STARTING, which indicates that the system has not successfully com- 
pleted initiation of the step. 


4. Issue: 

DU’ 

to see whether there is any critical device busy (BSY), mount pending (MTP), or not ready (NRD). 
5. Issue: 

‘D GRS,C’ 

to display resource contention. 


6. Using RMFMON, issue ’SENQ’ and ‘SENQR’ to see whether there is enqueue contention. You can also 
use RMF Monitor Ill for that purpose (ENQ, ENQR, or ENQJ options). 


7. If possible, check through the SYSLOG for any indications of a problem: 


e Storage shortage messages 
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10. 


11. 


Ne: 


13. 


e Subsystem resource shortages (VTAM buffers, or JES2 spool, for example) 
®  {/O error messages {indicating loss of devices, paths, CHPIDs) 


At the system console, invoke the SAD frame to display CPU utilization. If available, display processor 
utilization by storage key to determine whether a specific subsystem is causing the problem. 


At the system console, display the SAD frame showing ‘HI’ CHPID utilization. 


At the system console, display the system console log by pressing the VIEWLOG key. Scroll through 
the log looking for error messages, or other indications of a problem. Try to scroll back to when the 
system was last IPLed. or to a time when the system was last known to be running successfully, that 
is, performing normal work. 


lf, after doing the above, you are still unable to determine the cause of the wait state, take a dump of 
the master scheduler by issuing: 


‘DUMP COMM = (dumpt title)’ 

Reply ’U’ to message IEEO94D 

This reply causes the SDATA default options to be taken for the Master Scheduler address space. 
After the dump has completed, use the restart facility to invoke MVS/ESA system diagnostics. Issue: 
‘F SYSCTL’ 


at the system console. 


ge ee " 
a tae a i$ Cr ae er Teh ae 


Enter ‘C1’ and reason code ‘1’. - ev et | 


When the Restart function with Reason 1 is invoked from the SYSCTL frame on the system console, 
MVS/ESA checks: 


a. The Missing Interrupt Handler message queue and, if missing interrupt conditions exist for paging 
devices, the system notifies the operator through message: 


ey de 


 TOS116A MIH CONDITION PENDING ON PAGING DEVICE ddd 


b. The system non-dispatchability indicator. 
c. The WTO buffer usage. 


If all the above fails, take a stand-alone dump and re-IPL the system. 


Procedure 2 - Enabled Wait During IPL 


MVS/ESA may be waiting for the operator’s response to message : 


IEA1O1A SPECIFY SYSTEM PARAMETERS 


This situation occurs when this message is issued on a device that is not located near the operator. 


Use the procedure in “PD0O1” on page 20 steps 1 through 9 to help you locate the device where the mes- 
sage may have been issued. 


After determining the device number from the procedure in “PDO1” on page 20 , use the installation con- 
figuration diagrams to determine the device type of the device that presented the last I/O interrupt. 
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If the device number represents a VTAM CTCA, use the hardware system console command : 


-CHPID cc OFF 


to vary the associated CHPID to the VTAM CTCA offline. This will allow you to re-IPL without interference 
from the VTAM CTC and enable you to attempt to locate the IPL/NIP console being used. 


This is a common IPL problem. When the master console that the operator expects to be used is found 
to be unavailable by MVS, MVS selects another console. 


36 IBM ES/309036tm. Complex Systems Recovery and Availability 


9.0 IPL 


IPL time is a very sensitive period. Problems encountered at IPL time may be more difficult to diagnose; 
since the full recovery facilities of MVS are not yet available. 


In this guide, we deal with these problems in three groups: 


1. Wait States 


When a problem occurs and the system enters a wait state. 


Disabled Wait States and Enable Wait States are discussed in the previous chapters, and procedures 
are provided to help to recover from some of them. 


2. Messages 


When a problem occurs and a message is issued at the console. In this chapter, the following mes- 
Sages are described: 


[0S120A - DASD contention during pathing 

IEA120A - DASD contention during reading VOLSER 
IEA212A - duplicate VOLSER detection 

IEA317A - unable to locate required data set 


lOSOOO!I - 1/O error 


3. No Message or Wait State 


No wait state or console messages are provided for guidance when a problem occurs. This means 
that some hardware may be involved. 


Using Figure 19 try to determine what point in time during the IPL process the problem occurs. 
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Figure 19. IPL Flow Diagram 


If the IPL problem occurs between when the IPL function was invoked, and when the message ‘Specify | 
system Parameters’ is displayed, then one approach Is to remove everything from the configuration except 
the essential IPL DASD and console elements, which are: 


e The system residence volume 

¢ One channel path to the IPL volume 
e One console 

¢ A channel path to the console 


If all else fails, configure all channel paths other than those required for the console and the system resi- 
dence volume offline, and retry the IPL. 


10S120A - Shared DASD (DASD Contention) 


During DASD path verification (pathing), it is possible that when MVS attempts to do the pathing I/O op- 
eration to a device, that device may be ‘busy’ (actively working) with another system at the time. ! 


The pathing I/O operation is timed (since the Missing Interrupt Handler is not initialized at this stage in the 
IPL), and if after 1.5 seconds the I/O operation has not completed, the following message is issued: 


I0S120A DEVICE ddd SHARED. REPLY 'CONT' OR ‘WAIT’. 


If the operator replies ‘CONT’ to message 10S120A, the device or path is placed offline by MVS. 


If the operator replies ‘WAIT’ to message IO0S120A, the IPL process will not proceed until the I/O interrupt 
for the pathing operation is received from the device. 


If the same contention is experienced during DAVV (Direct Access Volume Verification) processing, where 
MVS reads the volume label from the DASD device, the following message is issued: 


IEA120A DEVICE ddd VOLID NOT READ. REPLY 'CONT' OR ‘WAIT’. 
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If the pathing I/O operation times out, but the device is not generated as “SHARED’ or “SHAREDUP’, then 
the operator is not notified, and the device will be placed offline. 


Summary 


PATHING OPERATION 


Device Busy on 
Sharing System 


Device Not Busy 
on Sharing System 


Generated as shared TOS120A message OK — device/path online 

Generated non-shared Continue without OK — device/path online 
Device 

READ VOLUME LABEL 

Generated as shared IEA120A message OK — device/path online 

Generated non—-shared Continue without OK — device/path online 


Device 


The fact that the device is found busy by the IPLing system may occur as a result of: 


Contention 
The device may be in use by a sharing system at the time this system is being IPLed. 
Outstanding Error Recovery 


The device may have a stuck allegiance’ as a result of the answer to message 10S427A, that occurred 
during a previous IPL session, (on this system or on any other sharing system), to which the operator 
replied BOX or NOOP. 


Current Errors 


If a problem exists on an interface, an I/O operation can take up to 16 seconds to timeout. However, 
NIP only waits 1.5 seconds, and so will time out first, thinking that a busy condition exists. 


Bad IOCDS 


If this IPL uses a new lIOCDS, incorrect definitions relating to the device may cause 10S120A messages 
to be displayed. 


If the device related to message 10S120A is not needed for a successful IPL operations, reply CONT to the 
message. Otherwise, the recovery (if possible) depends on what the cause is: 


Contention 
Go to other systems and determine the status of the device involved by entering the D U command . 
Error Recovery 


The operator should not have replied BOX or NOOP to message 10S427A. Refer to “Unconditional. 
Reserve’ on page 9/7 for a description of the 1|0S427A message. 


Contact your hardware service personnel. 
Current Errors 


Use the IFCC frame at the service console. Determine if any IFCC has occurred on the CHPID to the 
device related to the 1|OS120A message. If there are two or more paths to the device, use the service 
language command: 
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CHPID cc OFF 


to remove the affected CHPID from the configuration; then re-IPL. 
® Suspected Bad IOCDS 


Repeat the IPL and if the same problem occurs and there appears to be no other reason for con- 
tention, and a new IOCDS is in use, try to IPL using the old IOCDS. If the old IOCDS works, call your 
system and hardware support to investigate. Your old IOCDS may not support the devices getting the 
lOS120A messages. 

IEA212A - Duplicate Volumes 

After DASD pathing is complete, MVS performs DAVV (Direct Access Volume Verification) processing 


which includes reading the volume label of each online DASD device and updating the UCB, for each de- 
vice, with the corresponding volume label information. 


If more than one device is found to have the same volume label, the following message is issued: 


IEA212A DUPLICATE VOLUME volser D, xxx or yyy REPLY DEVICE ADDR 


If a second device is found to have the same volume label as the system residence volume, the following 
message is issued: 


IEA212A DUPLICATE SYSRES volser D, xxx REPLY DEVICE ADDR 


In both cases, the operator must specify a volume to be dismounted before the IPL process can proceed. 


@ When the latter condition is detected, message IEA212A asks the operator to dismount the volume that 
is not the IPL volume (You do not have a choice of which volume to dismount). 


Reply to message IEA212A: 
R 0,ddd 
The following message should be issued: 


IEA313I DEVICE ddd DISMOUNTED 


@ When a duplicate volume condition is detected, the operator can choose which volume to dismount. 


Refer to the ‘required volumes’ list of the installation to determine which of the duplicate volumes 
should be dismounted. 


Reply to message IEA212A: 


R 0,ddd 


The following message should be issued: 


IEA313I DEVICE ddd DISMOUNTED 


1IEA317A - Unable to Find a Data Set 


During the initialization process, the required system data sets are located using the volume information 
in the catalog. If a volume with the VOLID listed in the catalog for a particular data set was not found, and 
that data set is required during NIP, then the following message may be issued: 


IEA317A SPECIFY UNIT FOR ssss.yyyyyyyy ON vvvvvv OR CANCEL 
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This message indicates that when MVS attempted to locate data set ’ssss.yyyyyyyy’ using the volume 
pointer in the master catalog, no volume ‘vvvvvv’ had previously been read during DAVV. 


Refer to your installation’s Volume-id to ‘Device Number’ cross-reference list to determine which de- 
vice number is labelled “vvvvvv’. 


Respond to message IEA317A with the device number (xxx) using : 


R 00,xxx 


After entering the response, the following message might be issued: 


IEA318I UNIT UNACCEPTABLE 
IEA317A SPECIFY UNIT FOR ssss.yyyyyyyy ON vvvvvv OR CANCEL 


This message indicates that MVS has not read a volume label of vvvvvv on device number xxx. 


Determine the problem by accessing the device: 


Is there at least one online CHPID to the required device? 
Find the CHPIDs defined for the device using the IOPD frame option A5d. 
Check the CHPID availability for the current physical partition using the CHNCFA frame. 


If a subchannel is not available to support the device, check your installation information if the 
correct IOCDS has been selected. 


In LPAR mode, use the !OPD trame option Ad and the device number, to determine if the device 
is Supported and in the logical partition. 


When a device does not have a subchannel allocated to the target logical partition, you will see 
the message: 


REQUESTED DATA NOT DEFINED FOR LOGICAL PARTITION xxxx (60761) 


Determine the channels associated with the device in the lIOCP input or the installation config- 
uration diagrams and use the LPCHNA frame to determine which logical partition currently owns 
the CHPIDs. A logical partition must have at least one CHPID associated with a device online, in 
order to then have a subchannel for the device allocated to a logical partition. 


Is the control unit path operational? 

Check that the control unit is powered. 

Check that the control unit is enabled to the interface. 

If the contro! unit interface path is switched through a 3814, check the switching unit settings. 


For 3380 DLS or DLSE devices, check whether the device ‘Enable/Disable Switch’ is in the ‘Enable’ 
position. 


Is the device in a ‘Ready’ state? 


Was a previous 10S120A message for this device number replied to with a response of ‘CONT’? 
If a response of ‘CONT’ was used and the volume is required, you will have to re-IPL. 


This problem can also be caused by replying incorrectly to the following message: 


TEA347A SPECIFY MASTER CATALOG 


Confirm with the installation’s system/technical people what the correct master catalog reply 
should be. | 
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IOSO00I - SYSRES in Read/Only Mode 


It is not until towards the end of the IPL process that MVS attempts to write to the system residence vol- 
ume. All I/O up to this point has involved reading the system residence volume. 


MVS attempts to write to SYS1.LOGREC and issues message IOSO00I when SYS1.LOGREC resides on the 
system residence volume, and that drive is in read-only mode. This problem can occur on devices that 
Support physical Read/Write switches. 


The following message is issued: 


IOSOOOI ddd,1D,WRI,cc,0E40, 


This is an indication that it is not possible to write to the device. Check the Read/Write switch setting for 
those devices that provide Read/Write switches as an operator control. 
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6.0 Loops 


A loop is an endless execution of a sequence of instructions. Loops tie up the system and can prevent the 
execution of other tasks. The operator can stop the loop by cancelling the looping job. 


Typically, line 24 of the system console will appear as shown in Figure 20 during an enabled wait state. 


ITSC—POK O° beac gee eae: G2 eae) 0). Gea: A eee, Dae PSW1 Operating 


Figure 20. Hardware System Console Line 24 


One characteristic of a loop is that the wait indicator is off (observe the status line on the system console) 


If the operator console is locked out, and at least one CP stays at 100% utilization on the SAD frame the 
loop is probably a disabled loop. On a system operating in LPAR mode, the SAD frame will reflect activity 
greater than the assigned activity weight for the normal value of the processors. See “Disabled Loop” on 
page 46 for further information on disabled loops. 


If MVS accepts commands from the operator console, then the loop is probably an enabled loop. The SAD 
frame may show higher than normal CP utilization, but probably not 100% and probably not confined to 
one CP. See “Enabled Loop”, below, for further information. 


Enabled Loop 


During an enabled loop, communication with MVS through the operator console is still possible. (With a 
disabled loop, the operator console is locked out). | 


lf an application program is looping, the job’s specified TIME parameter should cause the system to cancel 
it after an appropriate interval. If a subsystem (VTAM, JES2, IMS, CICS, for example) is looping, use the 
appropriate procedures for those components. 


Procedure 
Use the following procedure to recover from an enabled loop: 
1. Identify the looping job or component as follows: 


a. If TSO is active, use RMFMON ARD, or ASRM, to identify the job that is using CPU time heavily 
with little or no I/O activity. 


b. For systems with SDSF installed, use the SDSF “DA’(Display Active) option and observe what CPU 
percentage is being used by each job. 


The value shown represents a percentage of the total processor complex power. Therefore, if a 
job running on an IBM ES/3090 6005S is using the power of one processor, SDSF will show 16% 
CPU utilization. The same job would show 25% on an IBM ES/3090 4005S and 50% on an IBM 
ES/3090 200S. So, based on what processor complex-you have, you should investigate any case 
where the percentage shown represents approximately the power of one processor. 


The approximate percentage shown by SDSF if the job is using the power of one processor (sys- 
tems in SI mode) is shown below: 
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ES/3090—600S 164 


ES/3090-—500S 204 
ES/3090-400S 254 
ES/3090—300S 334 
ES/3090—280S 50 
ES/3090-200S 504 


The values shown for Uniprocessor models 100S, 120S, 150S, and so on, may be anywhere be- 
tween a few percent and 99%, but would tend to show 90% + for a greater number of times if 
displayed when the job is in a loop. 


G | At the system console, invoke the SAD frame displaying CPU utilization by storage key. The 
various subsystems use different storage keys. For example, a loop in VTAM may be identified 
if the SAD frame shows a high processor utilization in storage key 6. 


d. If you can not use RMFMON, enter ’D A,A’ at regular intervals. Compare the field labelled CT 
(CPU Time) for all the jobs in the system. A job that is looping will show large changes in that 
field, perhaps as much as the time interval you use between display commands. This technique 
is not practical when there are a large number of address spaces in the system; therefore, it is 
recommended that you always have access to a monitor such as RMF. 
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When reasonably sure that a job is looping, enter: 
CANCEL jobname,DUMP 

to terminate the job. 

If the cancel does not work, then enter: 

FORCE jobname,ARM 

to terminate the job. 


Note: If it is not possible to recognize the looping job using the methods listed above, or if the Cancel 
or Force command does not resolve the loop, continue with the next step to invoke the Restart facility 
at the system console. 


Invoke the SYSCTL frame (Figure 21) at the system console. 


If operating in LPAR mode, from the system console, enter: 


SETLP lpname 


where ‘Ipname’ is the logical partition name. 


Stop the looping processor {identified on SAD frame) by entering the service language command 
“STOP CPn’. 


Note: It may not always be possibie to identify the processor thai a job is iooping in (this aiso appiies 
when operating in LPAR mode and the loop is in a ’shared’ logical partition). Because the loop is 
enabled, the job may be interrupted and re-dispatched on any processor at any time.. 


Identify the CP in the disabled loop by entering ‘Tn’, where ‘n’ is the CP number. The loop may be 
dispatched on a different processor after the SAD frame display is terminated. Therefore, use the in- 
dicators on the system status line to verify the target processor. 


Select option ‘C1’ on the SYSCTL frame to invoke the restart facility. 


? 


When prompted, select restart reason ’0’. The following message is issued: 
IEA500A RESTART INTERRUPT DURING jobname stepname 
ASID=aaaa MODE=mmmm PSW=xxxxxxxx XXXXXXXX 
REPLY RESUME TO RESUME INTERRUPTED PROGRAM 
OR ABEND TO ABEND INTERRUPTED PROGRAM 


This message identifies the task that was active at the time the restart was invoked. However, there 
is no guarantee that the task identified in the message is the one in the loop. If you suspect this is 
not the looping job, reply (RESUME’ and invoke restart once more. Eventually, it will be evident which 
job is looping. | , 


A reply of “ABEND’ terminates the task with a completion code of ’071’. 


Under some circumstances, message IEA500A may not be issued and the current task may be 
abended with completion code 071 immediately. 
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SCP MANUAL CONTROL (ESA/370 MODE) (SYSCTL) 

A= INITIALIZE SYSTEM CONTROL PROGRAM T= TARGET CP 

1. LOAD UNIT ADDR: .... X0. CPO se D4, “OPS 

2. LOAD PARM(A/N) : ......... X1. CPl X4. CP4 

3. INITIATE SCP INITIALIZATION X2. CP2 X5. CP5 
B= INITIALIZE STANDALONE DUMP R= RATE CONTROL 

AUTO STORE STATUS = ON —> 1. PROCESS 

1. LOAD UNIT ADDR: .... 2. I=oTEP 


2. INITIATE STANDALONE DUMP 
RESTART REASONS 
0 — ABEND CURRENT PROGRAM 
C= RESTART | 1 — PERFORM MVS SYSTEM DIAGNOSTICS 


—> 1. INITIATE RESTART 


REASON(A/N) : 0 


D= INSTRUCTION ADDRESS TRACE 
1. START ADDRESS TRACING 


15:09:28 CP3 RESTARTED 


COMMAND ==> 


OF Shes PSW3 OPERATING 


Figure 21. SYSCTL Frame 


Disabled Loop 


If one processor is looping in a disabled state, it is not be able to respond to a SIGP from another 
processor. In a non-uniprocessor environment, a disabled loop on one CP usually results in a Spin Loop 
Timeout. Recovery from a spin loop is described in “Spin Loop” on page 47. 


Procedure 


If a disabled loop occurs without spin loop detection, (for example, in a single CP environment), the fol- 
lowing procedure can be used to recover: 


1. 


46 


Invoke Instruction Address Trace, if required (not supported in LPAR mode). 


Note: Instruction Trace is not available in LPAR mode. 


If the cause of the problem is ‘not already known, the Instruction Trace (also known as ’Loop Record- 
ing’) facility of the processor controller may be used to gather a trace of the looping instructions before 
proceeding with recovery actions. 


Instruction trace is non-destructive and does not impair the chances of recovery afterwards, so it Is 
recommended as a valuable tool for subsequent problem diagnosis. Refer to “Instruction Trace” on 
page 91. 


Note: The instruction trace takes approximately 90 seconds per online CP. 
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2. Invoke the SYSCTL frame (Figure 21) at the system console. 


If operating in LPAR mode, at the system console, enter: 


SETLP lpname 


where ‘Ipname’ is the logical partition name. 
3. Select option ‘C1’ on the SYSCTL frame to invoke the restart facility. 
4. When prompted, select restart reason ‘0’. 
As a result, the following message appears at the MVS console: 
IEA500A RESTART INTERRUPT DURING pobnane sceouane a 
ASID=aaaa MODE=mmmm PSW=xxxxxxxx XXXXXXXX 


REPLY RESUME TO RESUME INTERRUPTED PROGRAM 
OR ABEND TO ABEND INTERRUPTED PROGRAM 


9. At the MVS console, reply “ABEND’ to the message. 


There is no point in replying ‘RESUME’, since the loop will continue indefinitely. Since the loop is 
disabled, no task other than the looping task can be dispatched on the CP. 


A reply of ‘ABEND’ terminates the looping task with a completion code of 071. 


Note: Under some circumstances, the message may not be issued and the current task may be 
abended with completion code 071 immediately. 


6. After the looping task has been terminated, normal operation is resumed. If the Instruction Address 
facility was used to trace the loop, issue a ‘DUMP’ command at the MVS console so that the loop trace 
data is captured in a dump data set for later processing by the System Programmer. (The loop re- 
cording data is not retrieved for the Abend 071 dump). 


Spin Loop 


A spin loop occurs when one processor in a multiprocessor environment is unable to communicate with 
another processor, or requires a resource currently held by another processor. The processor that has 
attempted communication is the ‘detecting’ or ‘spinning’ processor. The processor that has failed to re- 
spond is the ‘disabled’ or the ‘failing’ processor. 


When communication is not successful within a given time, an excessive spin loop timeout condition exists. 
The detecting processor then initiates recovery processing for the condition. 


MVS processing for excessive spin loop conditions provides recovery without any interaction with the op- 
erator. The default order in which the system takes action is SPIN, ABEND, TERM, and ACR. If the same 
Spin loop reoccurs, the system will take the next action. When a particular excessive spin loop has been 
resolved, any new spin loop causes the sequence of automatic recovery actions to start with SPIN and 
proceed through the sequence again. An installation can change the order of the actions, except the first 
one, that the system takes. 


e =6©SPIN - continues spinning for another time interval. 
e ABEND - terminates the current unit of work on failing CP but allows the recovery routines to retry. 


e TERM - terminates the current unit of work on the failing CP and does not allow the recovery routines 
to retry. 


@ ACR - invokes ACR to take the failing CP offline 


® OPER - issues message IEE331A via DCCF, and processes the operator reply. 
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When the system initiates any of the default or specified recovery actions, it issues message IEE1/8I to 
inform the operator. This message is strictly for information and the operator need not take any action. 


In a normal environment, the IBM supplied defaults for the system are sufficient. In a test environment, 
however, you might want to specify that other actions be taken. 


Member EXSPATxx 


Member EXSPATxx of SYS1.PARMLIB allows you to specify the action or actions to be taken, as well as 
the spin time duration, if the excessive spin is detected for one of the following. 


RISGNL RESPONSE 

LOCK RELEASE 
SUCCESSFUL BIND BREAK 
RESTART RESOURCE 
ADDRESS SPACE QUIESCE 
INTERSECT RELEASE 


e@ ¢ @ @ «& «6 


Spin loops caused by a SIGP failure are not supported by actions specified in the EXSPATxx member. For 
hardware related errors that formerly caused message IEA490A, the system immediately initiates ACR 
processing. 


If the cause of a persistent excessive spin is not resolved by the sequence of recovery actions, the system 
puts itself into a non-restartable ‘OA1’ wait state. To avoid this possibility both TERM and ACR should be 
specified as two of the actions to be taken. 


Handling Message IEE331A 


If an installation wants the operator to control the recovery actions, it can specify OPER in the EXSPATxx 
parmlib member. When the OPER action is reached in the recovery sequence, the system issues message 
IEE331A. If a response to message IEE331A is not received after 125 seconds, the message is written to 
the system console and the operator can respond through the SCPMSG frame. If the message cannot be 
written to any console, MVS loads a restartable wait state (WaitO9x) for the spin loop situation. 


Figure 22 shows the possible reasons for the spin loop and the recommended responses for handling the 
- various forms of message IEE331A. : 
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IEE331A PROCESSOR (Cy) IS IN AN EXCESSIVE DISABLED SPIN LOOP 
WAITING FOR (— msg insert —). REPLY U OR SPIN TO CONTINUE 
SPIN, REPLY ABEND TO TERMINATE WORK ON PROCESSOR(x) WITH 
RETRY, REPLY TERM TO TERMINATE WORK ON PROCESSOR(x) WITHOUT 
RETRY, OR STOP PROCESSOR(X) AND REPLY ACR. (AFTER STOPPING 
THE PROCESSOR, DO NOT START IT) 


MSG INSERT WAIT STATE ACTION 1 ACTION 2 ACTION 3 ACTION 4 
RISGNL RESPONSE O91 SPIN ABEND TERM ACR 
LOCK RELEASE 092 SPIN ABEND TERM ACR 
RESTART RESOURCE N/A SPIN ABEND TERM ACR 
ADDR. SPACE TO QUIESCE 095 SPIN ABEND TERM ACR 
CPU IN STOPPED STATE 096 START STOPPED CPand SPIN 
INTERSECT RELEASE 097 SPIN ABEND TERM ACR 
OPERATOR INTERVENING 099 START STOPPED CPand SPIN 
SUCCESS. BIND BREAK O9E SPIN ABEND TERM ACR 


Figure 22. Actions for Message IEE331A and Wait State Codes 


Procedure 
To recover from a spin loop, proceed as follows: 


1. Respond to message IEE331A according to installation procedures. If none exist, use the table in 
Figure 22 to determine the response. 


2. Before replying “ACR’, at the system console enter: 


STOP CPn 


where ‘n’ is the failing CP. 


If operating in LPAR mode, at the system console first enter: 


SETLP lpname 


where ‘Ipname’ is the logical partition name. This ensures that the logical processor to be stopped 
is in the correct logical partition. 


3. After replying to message IEE331A, restore normal console operations at the MVS console by per- 
forming the CANCEL action (ALT and PA2). 


4. Notify the System Programmer. Useful information about the environment at the time the excessive 
spin loop condition was detected is recorded in SYS1.LOGREC. 


9S. If ACR is used in response to message IEE331A: 
a. The active job on the processor in the spin loop is terminated with Abend OF3. 


b. The completion of ACR is indicated by one of the following messages: 


— 


IEA858E ACR COMPLETE CPU NOW OFFLINE 
IEA858E ACR COMPLETE CPU NOW OFFLINE, PHYSICAL VARY FAILED 
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The text “PHYSICAL VARY FAILED’ indicates that the processor is logically offline, but still phys- 
ically online. This message may be issued if the processor was not taken offline because the 
service processor was busy at the time the request was made. If you want to take the processor 
Offline physically, enter the “CONFIG CPU(n),OFFLINE’ command. 


Note: in LPAR mode, it is not possible to configure the processor physically offline. 


c. The processor that is configured offline may be configured back online, since there is most 
probably no hardware failure associated with the spin loop reported in message IEE331A. The 
cause was most likely software, and the failing job has been terminated with Abend OF3. 


Handling Wait State 09X 


When the spin loop timeout message cannot be issued at any console (operator or system console), MVS 
may load a restartable disabled wait state ‘O9X’. 


Procedure 


To recover from wait state 09X, proceed as follows: 


1. 


20 


At the system console, press the “VIEW LOG’ key and look for the priority message that shows the 
system entering the disabled wait state. It will be of the form: 


eal anflcn onan male anlace anf malice ea¥ ae on Pun en bse maan online saline afar wrline anFics rte malice wa fae mntaat mca malian as enlace an Nace Mice en nn wo Pae aa Pace ma fics Paes wa Pco ea bace mnfinn anos Face Sco malas onPace mSinn aalian ace wae an Pan malice once satan mala malo 
EV IV AVIV GY ED EL AL EY ED GV AS GR EY ED OD IE AE OS ED ED CD OV IV IVIL IN GL IV IV IV EV IV AV EVIL EVIL IV IV ED AL IV IV EV CD OD OS OD 


ote ato 
cA aS 


* CPy has entered disabled wait. * 
* PSW = 000A0000 004x009n * 


* Intended Console: System 


wlio 


ay 


CPy is the processor detecting the excessive spin condition, and is NOT the failing processor (the 
processor with the problem). 


Determine the failing processor from bits 40-47 of the disabled wait state PSW (refer to the priority 
message shown above). The sixth byte of the PSW contains the logical id, in the form of “4x’ (x = CP 
number) of the processor causing the spin loop. 


For example, if CP3 failed and wait state ‘092’ was entered on another CPU, the PSW would look like 
OO0A0000 00430092 
According to the installation recovery option, proceed with the 09X wait state recovery, as indicated 


below. If operating in LPAR mode, before proceeding with the recovery actions, ensure that the cor- 
rect logical partition is targeted by entering the following command at the system console: 


SETLP lpname 
If your recovery option IS: 
a. SPIN, sieeeen as indicated in step 14 
b. ABEND, proceed as indicated in steps 5 to 12, and 14 
C. TERM. proceed as indicated in steps 5 to 12, and 14 
d. ACR, proceed as indicated in steps 5 to 16 
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5. At the system console, enter: 
‘F ALTCP’ 


6. Select CPy (CPy being the CP indicated in the disabled wait state message; that is, NOT the failing 
CP, but the detecting CP). 


7. Press the STOP key. 
8. Enter: 
‘A1 B3’ (Alter Primary Virtual Storage) 
9. Then enter: 
“30E’ at “Address(hex) = >’ 
10. Move the cursor to location X’30E’ and type the one of the following restart codes: 
‘CC’ (to initiate ABEND) 
‘BB’ (to initiate TERM) 
‘AA’ (to initiate ACR) 
11. Press ENTER. 
12. Press the START key. 
13. Stop the failing processor by entering ‘STOP CPx’ on the system console. 
14. Restart CPy by entering: 
‘RESTART CPy’ 
on the system console. 


15. At this time, the failing CP (CPx) is taken offline by ACR, and normal console communication is re- 
stored. The job that was active on CPx at the time is terminated with Abend OF3. 


The completion of ACR is indicated by one of the following messages: 


IEA858E ACR COMPLETE CPU NOW OFFLINE 
TEA858E ACR COMPLETE CPU NOW OFFLINE, PHYSICAL VARY FAILED 


The text “PHYSICAL VARY FAILED’ indicates the processor is logically offline, but still physically on- 
line. This message may be issued if the processor was not taken offline because the service 
processor was busy at the time the request was made. If you want to take the processor offline 
physically, enter the “CONFIG CPU(n),OFFLINE’ command. 


Note: In LPAR mode, it is not possible to configure the processor physically offline. 
16. The processor taken offline during ACR processing should be configured back online. Enter 


“CF CPU({x), ONLINE’ 


Instruction Trace 


An instruction trace can be best used to trace a disabled loop. Tracing enabled loops usually causes the 
loop trace table to become filled with information not in the loop (such as instructions from tasks that in- 
terrupt your looping program during the trace). 


Note: You cannot initiate an instruction trace from the SYSCTL frame in LPAR mode. 
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You can record a loop by selecting the instruction trace option (D1) on the SYSCTL frame on the system 
console. This facility records 982 instruction-counter values for each CP traced. After the loop trace option 
has completed recording, the processors are left in the state they were in when instruction trace was se- 
lected. 


The recording function stops all processors and traces all online CPs, starting with the target CP. It is not 
possible to trace only one processor, and once the tracing is started, it cannot be interrupted until all CPs 
have been traced. 


The instruction tracing function takes about 90 seconds per CP. Therefore, on an ES/3090 Model 600E, 
tracing disrupts system operation for approximately eight to nine minutes. 


While instruction tracing is in progress, all consoles (both system console and MVS/SP V2 and V3 con- 
soles) are locked out. (The MVS/SP V2 and V3 consoles are locked out because the CPs are stopped 
during the tracing.) 


The loop data is retrieved by subsequent SVCDUMPs or stand-alone dumps. If the tracing is done in PP 
mode, retrieve the trace data through the DUMP command or stand-alone dump before merging, or the 
information may be lost. 


After the instruction trace is completed, the operator can issue a console DUMP command to retrieve the 
trace data. Both the instruction tracing and the dynamic dumping functions are non-destructive; that is, 
they will not abnormally terminate the interrupted unit of work, and normal processing can continue aft- 
erwards. lf the loop appears to continue, it may be worthwhile to request a RESTART Option 0 to terminate 
the looping job. In case of disabled loops, reply ACR to message IEE331A. The trace data will be available 
in the stand-alone dump, if the operator requests it, after the loop recording is complete. 


Procedure 

To trace a loop use the following procedure: 

1. Select the SYSCTL frame (Figure 23) at the system console: ‘’F SYSCTL’. 
2. Start loop tracing by selecting: ‘D1’ (Start Address Tracing). 


During loop tracing, progress messages are written to the system console: 
INSTRUCTION—ADDRESS TRACE IN PROGRESS. (62118) 
INSTRUCTION—ADDRESS TRACE IS STARTED ON CP3. (62119) 
INSTRUCTION—ADDRESS TRACE IS IN PROGRESS ON CP3. COUNT=100. (62120) 
INSTRUCTION—ADDRESS TRACE ON CP3 IS COMPLETED. COUNT=982. (62121) 
INSTRUCTION—ADDRESS TRACE IS STARTED ON CP4. (62119) 


When instruction tracing is complete, the following message is written to the system console: 
INSTRUCTION—ADDRESS TRACE IS COMPLETED FOR ALL CPS. (62122) 


3. Save trace and storage by taking either an SVC dump or a stand-alone dump. 
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SCP MANUAL CONTROL (ESA/370 MODE) 


A= INITIALIZE SYSTEM CONTROL PROGRAM 
1. LOAD UNIT ADDR : 


2. LOAD PARM(CA/N) 
3. INITIATE SCP INITIALIZATION 


B= INITIALIZE STANDALONE DUMP 
AUTO STORE STATUS = ON 
1. LOAD UNIT ADDR 
2. INITIATE STANDALONE DUMP 


C= RESTART 
1. INITIATE RESTART 


D= INSTRUCTION ADDRESS TRACE 
—> 1. START ADDRESS TRACING 


COMMAND ==> D1 


26 JUL 88 19:42:40 


(SYSCTL) 

T= TARGET CP 
XO. CPO => 3a (CPS 
X1. CPl 4. CP4 
X2. CP2 5. CP5 


R= RATE CONTROL 
—> 1. PROCESS 
2%, L=O1EP 


PSW3 OPERATING 


Figure 23. SYSCTL frame 


Loops 


33 
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7.0 Loss of MVS Consoles 


The MVS console ts one of the most critical devices in an installation. The loss of one or all consoles at- 
tached to a system does not cause the system to fail; however, the inability of the operator to restore the 
use of a console may lead to an unnecessary IPL. 


Loss of Master Console 


The master console function may be transferred to any other available physical console by one of the fol- 
lowing means: 


Automatically by MVS. When an 1/O error is detected on the current master console, the function is — 
switched to the next console in the ring. 


As a result of the operator pressing the EXTERNAL INTERRUPT key on the system console when a 
master console exists. 


As a result of the operator issuing the command: 


VARY ddd, MSTCONS 


Switching the master console function causes the following messages to be displayed on the new Master 
Console. 


IEE143I CONSOLE SWITCH REASON= reason 
OLD=console NEW=console 


IEE129I CONSOLE SWITCH, OLD= oldconsole 
NEW=newconsole REASON= reason 


Procedure 


To restore a console, proceed as follows: 


1. 


4. 


A reason of OER indicates that an I/O error occurred on the old console. A reason of EXT indicates 
that the EXTERNAL INTERRUPT key was pressed. To determine the current state of the lost console, 
enter the following command: 


D U,,,nnn,l 


where ‘nnn’ is the device number of the lost console. 

Correct the error condition on the old console if the reason was IOER. 

The master console function can be returned to the old console with the following commands. 

a. VARY ddd,ONLINE (if not already online) 

b. VARY ddd,CONSOLE 

c. VARY ddd,MSTCONS . 4, 0 mete we rtio, Suites: depiter. sant ele: 

Message IEE143I and IEE1291 with REASON=VMST will be displayed on the new master console. 
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Loss of All Consoles 


A ‘no console’ condition exists when MVS is unable to locate a console for communication with the oper- 
ator. This condition may occur as a result of a hardware failure on a device or control unit or inadvertent 
use of the EXTERNAL INTERRUPT key. 


At IPL time a ‘no console’ condition results in a WAITOO7. During normal operation, MVS continues proc- 
essing, and alerts the operator by sounding the alarm and displaying a priority message on the system 
console. The message is hardware EC dependent and may not look exactly as the one shown below: 


(64400 ) 
WUREGEAERREEERERERERERERRE RE PR TORITY MESSAGE eee REECE EERE EEE ER EE EE 
PAS PAS 
ales alo 
ray yA 
ales alae 
q .Y PAY 
als : Ps) 
* | Operator console not operational we 
ve : We 
ri vs 


* Intended Console : System 
* Detailed Information: The SCP is unable to send messages to any 
% I/O device specified as an operator as 


w console. ws 


* System Action: The audible alarm is sounded and the system waits * 


awlue bd e PP 
* for operator intervention. a 
CAS qe 
%, ‘ e e ° . whee 
* User Action: Identify the cause of trouble with the SCP "6 
wl wSee 
CAN ee 
alae ales 
a CAS 
@y IY IU ID ID GD AV AID AV ID IV ED IV ID EV ID IV EV AV IY IV IV IV IB AV ID IB IBID IVT IV IV IS IL IH ED ID IV IT CV IV IV ET ID IHS IDB IL ES ILIV ID ID ID IVIL IV IV IV IDV IV ID FH GT ID ID ID CD 


The EXTERNAL INTERRUPT key can be used during this condition to allow a controlled master console 
switch to a device that is: | 


e Online. 

e Generated as a console. 

e Not allocated (to TSO, for example) at the time the ‘no console’ condition occurred. 
e Generated correctly and does not cause checks to occur when it is initially used. 
Procedure 

To recover from the loss of all consoles, proceed as follows: 


‘ 1. Determine the problem with the consoles and correct it. 


2. Press the ENTER key on the intended master console. 
3. Press the EXTERNAL INTERRUPT key on the system console. 
If operating in LPAR mode, before pressing the EXTERNAL INTERRUPT key, enter: 


SETLP lpname 


where ‘Ipname’ is the logical partition name. 
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4. Press the ENTER key on the master console if necessary. Normal operation will be resumed at the 
master console. 


5. Restore the other consoles to the normal configuration. Use the following commands to bring all other 
consoles online: 


VARY ddd ,ONLINE 
VARY ddd ,CONSOLE 


6. Display the current status of all active consoles by using the command: 
DC,A 


7. Set up console specifications. 


Note: If you cannot recover from a ‘no consoles’ condition, a WTO buffer shortage eventually occurs. 
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8.0 Disabled Console Communication Facility 


The Disabled Console Communication Facility (DCCF) is a facility used by MVS to communicate with the 
operator when a condition requiring immediate operator intervention is detected. MVS sends the message 
to the operator console (MVS master or first alternate console). if no response is received within 125 
seconds, then the message is sent to the system console 


Note: Most messages are routed to the system console after timing out; however, One exception is 
message: 


TEA502I RESTART REASON COULD NOT BE OBTAINED FROM SYSTEM CONSOLE 


If no response to message IEA502I is received after 125 seconds, restart reason 0 is assumed, and the 
current unit of work may be terminated with Abend 071. The system resumes normal processing. 


In most cases, when DCCF is running, nothing else can take place in the system (the system is disabled). 
Therefore it is essential to resolve the DCCF condition as quickly as possible. The situations where DCCF 
is used, include the following: 


e §Hot I/O (refer to “Hot I/O” on page 101) 
@ Unconditional reserve recovery (DASD IFCC) on a device with a page data set uSi to “Page Data 
Set Volume Error” on page 99) 


Recognizing DCCF 
A DCCF situation is easy to recognize because: 


e A single console, either the master console or its first alternate, is cleared and a single message is 
displayed at the top. At the bottom of the screen, a reply field is written in the form “RO, ’. An ex- 
ample of a message is shown in Figure 24. 


¢ In most DCCF situations, no other I/O activity is apparent in the system. (The other consoles do not 
accept input.) The exception is during Restart DCCF processing, when the system is enabled and 
communication is possible through other consoles. 


If, for any reason, the master console or its first alternate is not available, or if, within 125 seconds, no reply 
is entered to a message that has been sent to the master or first alternate console, the message is routed 
to the system console. When the DCCF message is written to the system console, priority message 64400 
(Operator console not operational) is displayed and the audible alarm is sounded. Pressing the ENTER 
key on the system console restores the previous frame and the following message its displayed: 


SCP messages are pending. Invoke or REFRESH the SCPMSG frame. (35201) 


To invoke the SCP Message Facility frame, the ‘F SCPMSG’ command must be issued. An example of the 
frame is shown in Figure 25. 
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IEE1271 THE FOLLOWING MESSAGE IS ISSUED THROUGH DISABLED CONSOLE FACILITY 
I0S110A IOS HAS DETECTED HOT I/O ON DEVICE ddd (NON—DASD). THE LAST 
INTERRUPT FROM THIS DEVICE WAS ON CHANNEL PATH xx. THE SCD 
IS AT aaaaaaaa. THERE ARE nn DEVICES WITH HOT I/O ON CHP xx. 


ENTER ONE OF THESE REPLIES TO TELL IOS HOW RECOVERY IS TO 
BE HANDLED: 


NONE THIS REPLY TELLS IOS THAT (1)THE OPERATOR DID NOT PHYSICALLY 
REMOVE ANY DEVICE OR CONTROL UNIT (HE MAY OR MAY NOT HAVE 
RESET THE DEVICE) AND (2) IOS SHOULD NOT REMOVE ANY DEVICE 
AND NOT ATTEMPT ANY CHANNEL RECOVERY. 


DEV THIS REPLY TELLS IOS TO LOGICALLY REMOVE (BOX) THE DEVICE. 
(THE OPERATOR MAY OR MAY NOT HAVE PHYSICALLY REMOVED THE DEVICE) 


CU ‘THIS REPLY TELLS IOS THAT THE OPERATOR PHYSICALLY REMOVED THE 
CONTROL UNIT. THE REPLY MUST INCLUDE THE NUMBER OF EACH DEVICE 
ON THE CONTROL UNIT. FOR EXAMPLE, IF DEVICES 25E, 250 THRU 257 
REPLY: CU,250:257,25E 
OR 
CU ,25E,250:257 


CHP,K THIS REPLY TELLS [OS (1) TO ATTEMPT RECOVERY FOR THE CHANNEL 
PATH NAMED IN THE MESSAGE, AND (2) IF RECOVERY IS SUCCESSFUL, 
TO KEEP THE CHANNEL PATH ONLINE. 


CHP,F THIS REPLY TELLS IOS TO FORCE THE CHANNEL PATH OFFLINE. 


R 0, 


Figure 24. Message |IOS110A 


The cursor is positioned on line 23 when the frame is displayed. Sometimes, it is necessary to enter 
ES/3090 hardware Service Language Commands on line 23 before responding to the message. For ex- 
ample, if there was a need to STOP a CP prior to entering the response, you would issue the “STOP CPn’ 
service Language Command on line 23 first. The response to the message must be made on line 21, 
therefore, if there is no ES/3090 Service Language Command to enter, position the cursor on line 21 by use 
of the TAB key. 


Note: There is no 125-second timeout for the reply to a message on the system console. The message 
stays pending until a reply is entered. 
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28 OCT 87 12:28:44 
SCP MESSAGE FACILITY (SCPMSG) 


TEE127I THE FOLLOWING MESSAGE IS ISSUED THROUGH DISABLED CONSOLE FACILITY 
I0S110A IOS HAS DETECTED HOT I/O ON DEVICE ddd (NON-DASD). THE LAST 
INTERRUPT FROM THIS DEVICE WAS ON CHANNEL PATH xx. THE SCD 
IS AT aaaaaaaa. THERE ARE nn DEVICES WITH HOT I/O ON CHP xx. 


ENTER ONE OF THESE REPLIES TO TELL IOS HOW RECOVERY IS TO BE HANDLED. 


NONE -—-THIS REPLY TELLS IOS THAT (1)THE OPERATOR DID NOT PHYSICALLY 
REMOVE ANY DEVICE OR CONTROL UNIT CHE MAY OR MAY NOT HAVE RESET 
THE DEVICE) AND (2) IOS SHOULD NOT REMOVE ANY DEVICE AND NOT 
ATTEMPT ANY CHANNEL RECOVERY. 


DEV —THIS REPLY TELLS IOS TO LOGICALLY REMOVE (BOX) THE DEVICE. 
(THE OPERATOR MAY OR MAY NOT HAVE PHYSICALLY REMOVED THE DEVICE. ) 
CU —THIS REPLY TELLS IOS THAT THE OPERATOR PHYSICALLY REMOVED THE 


CONTROL UNIT. THE REPLY MUST INCLUDE THE NUMBER OF EACH DEVICE ON 
IEE126I COMPLETE TEXT OF MESSAGE IOS110A CANNOT BE DISPLAYED 


SSH S16) 1) Oa ee ee av ee ee ee er ee eee ee 


USE THE RESPONSE FIELD TO REPLY TO SCP MESSAGES. (35137) 


COMMAND ==> 
PSW3 OPERATING 


Figure 25. SCPMSG Frame 


Responding to Messages Issued Through DCCF 


Handling DCCF requires special care because of the following conditions: 
© DCCF Lockout 


Interrupts at other terminals on the same control unit as the master or alternate console can make it 
impossible to enter a response to the message from the master or its first alternate console. 


In DCCF mode, MVS accepts interrupts only from the console where the message has been written. 
If TSO users on screens attached to the same control unit present interrupts (by depressing the ENTER 
key several times), they lock out the operator’s response. This will result in a DCCF timeout. 


Be aware of this situation if your installation does not have the master console and its alternate at- 
tached to dedicated control units. 


Correction of this situation is to configure the MVS consoles as documented in /BM ES/3090 Complex 
Systems Recovery and Availability Configuration Considerations, thereby preventing a DCCF lockout 
condition in the first place. 


© DCCF Timeout 


If the message is not answered within 125 seconds after its appearance at the master or first alternate 
console, it is routed to the system console. There is no timeout for the reply to a message on the 
system console; the message remains pending until a reply is entered. 
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® Restartable Wait States — 


A disabled restartable wait state is loaded only if the operator reply to the message cannot be pre- 
sented through the MVS master or alternate consoles, or the system console; that is, when all other 
attempts to communicate with the operator have failed. 


A response is still required from the operator, and recovery from the restartable wait state involves 
the operator entering his response into an area of main storage. MVS retrieves the operators re- 
sponse when the CP is restarted from the wait state. 


Refer to “Disabled Wait States” on page 7 for a description of disabled wait state recovery. 


Figure 26 gives a cross reference between the most common messages and the corresponding disabled 
wait state codes. 


Restartable Reason 


Wait State Code 


Message 


ITEE331A software error 


(Spin Loop) 


091, 092, 095, 
096, 097, 099 
09E 


f [OS110A 
IOS111A 


Hot I/O non-—DASD, non-—DPS 
Hot I/O non-reserved DASD 
| or non—-assigned DPS 
Hot I/O reserved DASD 

or assigned DPS 


IOS427A O6F Channel or CU problem on 
| string containing page data set 


IOS062E 062 Channel path recovery 
I0S113W 
IOS201E 


ITEA500A 
IEA501A 
TEA5021 


/TOS115A 
I0S116A 
I0S109E n/a 


Figure 26. Messages Issued through DCCF 


10S112A 


Restart Reason 0 
Restart Reason 1 


Page data set problem | 
Missing Interrupt on page device 


Cancel Action 


Often, after replying to the message from the master console or its first alternate, the following message 
is issued: 


IEE128A PERFORM THE CANCEL ACTION TO RESTORE THE NORMAL DISPLAY 
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This can cause some confusion since most displays do not have a key marked ‘CANCEL’. The PA2 key 
performs the CANCEL action, restores the screen, and clears the entry area. 
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9.0 Device Boxing 


Device boxing Is forcing a device offline in such a way, that current and future I/O operations are termi- 
nated with a permanent I/O error, and that no new allocations to the device can be made. The device is 
marked Pending Offline and will go offline as soon as soon as the last allocation is freed. The UCB is 
marked “boxed’. 


Device boxing takes place: 

e Following a channel path error, if the last path to a device Is lost. 

e Following a channel path error, if a reserve on the device is lost. 

e Following a VARY ddd,OFFLINE,FORCE command for the device. 

e Following a CF CHP(cc),OFFLINE,FORCE’ command if the last path to the device is removed. 


e When the resource required to access the device becomes unavailable or is no longer available (for 
example, taking a CHPID by using the system console). 


e As aresult of replying ‘BOX’ to message l0S427A during Unconditional Reserve recovery (Refer to 
“Unconditional Reserve” on page 97). 


@ If Unconditional Reserve recovery failed. 


e When a hot I/O condition was detected on the device, and the recovery action selected (either by de- 
fault or by the operator) was “BOX’ (refer to “Hot !/O” on page 101). 


e lf the initialized state of a subchannel changes outside the control of MVS, and the device is still in 
use by MVS or a user. 


While a device is boxed: 


® No further |/O operations can be performed to the device. (Any I/O operation request fails with a 
permanent I/O error.) 


@ No new allocation for the device is accepted. 


@ The device is marked PENDING OFFLINE and goes into offline status when the following conditions 
occur, in the following order: 


1. The device is no longer allocated to any job. 
2. Allocation processing allocates any device in the system. 


It is very important to understand that in the case of shared DASD devices, a boxed device is boxed only 
to the system that originated the boxing. The device is still accessible from other systems. This situation 
may lead to incorrect (or incomplete) data on the DASD volume. Such a situation must be reported to the 
owner of the data on the boxed DASD. 


A DASD device that was offline (either boxed or not) will have its volume serial number read by the vary 
online Operation. This information is placed into the UCB as part of the vary online process, providing 
there are no out-of-line conditions such as a duplicate volume. 
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The following messages are examples of device boxing for DASDs: 


IOS100I DEVICE ddd BOXED/FORCED OFFLINE, LAST PATH cc LOST 

I0S102I DEVICE ddd BOXED/FORCED OFFLINE, OPERATOR REQUEST 

I10S102I DEVICE ddd BOXED/FORCED OFFLINE, PERMANENT ERROR, RESERVE LOST 
I0S152E DEVICE ddd BOXED BY SUBCHANNEL RECOVERY, DEVICE STATE UNKNOWN 
I0S153E DEVICE ddd,BOXED STATE, NOW AVAILABLE FOR USE 

I0S451I ddd BOXED, RESERVE LOST 

IOS451I ddd BOXED, NO ONLINE OPERATIONAL PATHS 

IOS451I ddd BOXED, DISBAND AND REGROUP OF PATH GROUP FAILED 

IEF281I ddd NOW OFFLINE/DEVICE IS BOXED 


Boxed devices may be varied online as follows: 


Procedure 


A device that is boxed and offline can be brought back online with the “VARY ddd,ONLINE’ command. 


A device that is boxed but still online, cannot be used. It can be made operational with the command 
‘VARY ddd,ONLINE, UNCOND’. 


This form of the command should only be used under the direction of your systems support personnel. 


\ 


To recover a boxed device, proceed as follows: 


1. 
2. 
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In most cases, the operator should make the boxed device offline to all sharing systems. 
The cause for the boxing must be determined, and any required hardware repair actions taken. 
In the case of a broken device, the device must be repaired before proceeding with step 3.. 


In the case of a broken control unit, the device should only be used over the other (good) control unit 
paths. The broken control unit may be repaired at a later time. Proceed to step 3.. 


In the case of a broken channel, the device should only be used over other (good) channel paths. The 
broken channel may be repaired at a later time. Proceed to step 3.. 


To bring the device online to allow your systems support personnel to verify the data on the boxed 
device, proceed with one of the following: 


a. If the device is offline and boxed (F-BOX), vary the device online using the following command: 


VARY ddd,ONLINE 


b. If the device is allocated and boxed (A-BOX), determine who is allocated to the device using the 
following command: 


D U, ,ALLOC,ddd,1 


Use your installation procedures to unallocate users of the device. (You may have to cancel jobs 
or TSO users.) 


If it is not possible to unallocate all users of the device {for example, a system task), then proceed 
to step c. on page 6/7. 


If necessary, use your installation’s deallocation procedure (for example, “‘S DEALLOC’), to cause 
the device to go Offline. 


Vary the device online, using the following command: 


VARY ddd ,ONLINE 
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For a boxed allocated device, the above procedure is the preferred method for bringing the device 
online, as it allows the device to be taken offline before it is brought back online. This causes 
MVS to perform VOLSER verification and path validation. 


Proceed to step 4. to verify the data on the volume. 


c. A device that is allocated and boxed, but not offline, may be brought online under the direction 
of your system support personnel, using the following form of the Vary command: 


VARY ddd,ONLINE, UNCOND 


Note: When this form of the command is used to bring the device online, VOLSER verification 
is not performed. 


Verify or repair the data if necessary, or at least notify the owners of data on the volume. Ifa potential 
data integrity problem exists, your systems support personnel must check the data before the device 
is placed online to any system for starting productive work. 


The following tools may (among others) be used to verify data: 
e = =LIST VTOC for VTOC 

e IDCAMS with DIAGNOSE option for VSAM catalogs 

e IDCAMS with VERIFY option for VSAM data sets 
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10.0 Missing Interrupts 


Missing Interrupt Conditions 


A missing interrupt condition exists when an interrupt is expected, but fails to occur within a specified time. 
The time interval varies according to device type and installation specifications. 


The default time intervals are 15 seconds for DASD devices, 3 minutes for other device types (except MSS) 
and 12 minutes for MSS devices. These values are specified in the IECIOSxx member of SYS1.PARMLIB. 
Figure 2/7 shows the default specifications. 


MIH DASD=00:15 

MIH INTERVAL=03 :00 

MIH 3330V=12:00 

MIH 3851=12:00 

MIH DEV=none DEV and TIME (as a group) can be used to 

MIH TIME=none alter the interval for selected device numbers. 


Figure 27. Default Parameters for Missing Interrupt Detection 


MVS/ESA allows the display and setting of missing interrupt times with two commands: 


D IOS ,MIH, TIME=ALL —display all MIH times and settings 
SETIOS MIH,DASD=00:15 —-set DASD MIH time to 15 seconds 
SETIOS MIH,DEVICE=2A,TIME=03:00 -set MIH time to 3 minutes for 2A0 


The MVS Missing Interrupt Handler (MIH) notifies the operator wnen an expected interrupt fails to occur 
within the specified time. Several conditions can lead to a missing interrupt: 


e An outstanding mount for a tape or a DASD device. 
¢ An 1/O request that has been initiated by the software but has not completed in the I/O subsystem. 


lf an expected interrupt does not occur in the allotted time, the MIH initiates recovery actions and informs 
the operator before system performance is severely impacted. 


The operator is notified of an MIH situation through the messages defined in the following section. 


Missing Interrupt Handler Messages 


The messages in this section correspond to missing interrupt situations. The messages are described in 
detail and the description suggests the appropriate operator action. 


Handling Message IOS070E 


IOSO70E ddd, MOUNT PENDING 


MIH has detected a mount pending condition for device ddd. 
Procedure 


1. Use the DR,L command and mount the required volume. 
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2. 


Ready device ddd and issue the VARY ddd,ONLINE command. 


Handling Message 10S071I 
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IOS071I ddd,cc,jjj,text 


Where ddd = Device Number 


cc = Channel Path ID 
333 Jobname 
text= One of the following 


Missing Channel and Device End 


If the missing interrupt is occurring for a DASD device, the cause may be contention, that is preventing 
the device from reconnecting. Refer to ’MIH Diagnostic Procedures” in Figure 32 on page 7/5 


Missing Device End 


Procedure 
1. Check the device for hardware error indications. 


2. If you just finished rewinding a tape or mounting a volume, issue a “VARY ddd,ONLINE’ to simulate 
a device end. 


Halt or Clear Subchannel Interrupt Missing 
Procedure 


1. These messages usually indicate hardware errors. For these errors, the MIH issues a “CLEAR 
SUBCHANNEL’ instruction to the device. If the problem persists, try to force the device offline and 
report the error to the service representative. 


2. If the device is a 3851 MSS, this situation might not be a problem; therefore, no recovery action 
is taken by the MIH. 


3. Check SYS1.PARMLIB member IECIOSxx. Missing Halt Subchannel interrupts be recognized if 
this value is set too low. 


Idle With Work Queued 


The error is usually software, but can be hardware. The system has work queued to the device, but 
the Channel Subsystem has no I/O request active for that device. 


Determine if any error recovery situations have recently occurred at the device/s or CHPIDs to the 
device. 


The Missing Interrupt Handler resets the device and passes an I/O request to the channel. Refer to 
“MIH Diagnostic Procedures” in Figure 29 on page 72 


Start Pending 
Procedure 


1. If message 1OS071I is followed by message !10S4521I (ddd,xx) OPERATIONAL PATH ADDED TO 
PATH GROUP, refer to "MIH Diagnostic procedures in Figure 30 on page 73 


2. The first thing you should do is to try to understand the reason for the message. In a shared 
system environment the “START PENDING’ message does not necessarily indicate an error con- 
dition; it could mean one system owns a DASD actuator exclusively and the other system is trying 
to access the same disk. If the message is repeated many times, there may be a problem and 
operator intervention might be needed. : 
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3. For repeated “START PENDING’ messages, first determine whether the device is shared with one 
or more other systems. If it is not shared, there is probably a hardware error on the device. If 
the device is shared, there might be a contention problem caused by one of the sharing systems. 
Refer to Figure 28 on page 72 


Handling Message IOS075E 


IOSO75E ddd RECURRING MIH CONDITION FOR THIS DEVICE 


MIH processing is trying to recover a previous missing interrupt situation by issuing a CLEAR SUBCHAN- 
NEL instruction to reset the device, but due to hardware problems, the device or the condition was not 
reset. 


If you verify that this condition is not a result of a shared systems contention as described in Figure 31 on 
page 7/4, vary the device offline as follows: 


VARY ddd,OFFLINE,FORCE 


Handling Message IOS076E 


IOSO76E ddd,pp,jjj,text 


The format of this message is identical to that of message 1OS071I, and it is issued in the following situ- 
ations: 


® A ciear subchannei interrupt is missing. 
¢ The MIH exit routine tor the 3851 indicated that the device is not to be reset. 
If the device is not a 3851 (MSS), vary the device offline using the FORCE operand. 


If it is a 3851, check the device. If it has an unrecoverable problem, and a backup MSC is available, use 
the clear switch. If there is no error indication, do nothing and wait for the operation to complete. 


Handling Message IOS077E 


IOSO77E ddd,pp,jjj,text 


This message is similar to message |0S071I and indicates a recurring situation. Normally, it accompanies 
message IOSO7/5E. 
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Types of Missing Interrupt Actions Corrections 
Mount Pending D R,L Mount required VOL 

D U,,,ddd,1 V ddd,online 
Idle with Work Queued D U,,,ddd,l Refer to Figure 29 
MIH and DPS Out of Sync Refer te Figure 30 
Start Pending Refer to Figure 31 
Missing CE/DE Refer to Figure 32 
Missing DE (only) _ non DASD Refer to text 
Missing HSCH Notify hardware support 


Missing CSCH Notify hardware support 
Figure 28. MIH Diagnostic Procedures DASD - General 


Symptom 


Idle With Work Queued 


— Reported detection: 


— Causes 


Software Probably MVS software problem. 
May occur after VM bounce. 


—- Actions 
MVS console 
MVS Commands: 
DUMP COMM=(operator dump name) 
R n,ASID=(a) ,SDATA=(NUC ,SQA,TRT). 


MVS console — System console logs. 


EREP Event Report. 


Figure 29. MIH Diagnostic Procedures DASD - Idle With Work Queued 
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Symptom 


MIH and DPS out of sync — 


— Reported detection. 


The following messages when reported one after 
another for the same device is considered to be a 

DPS out of syne condition. 

IOSO71I ddd,yy,xxxxxxx,START PENDING 

IOS452I (ddd,xx) OPERATIONAL PATH ADDED TO PATH GROUP 


— Causes 
Channel 


Control Unit 


3814 switch 


3814 switch 


Control Unit 
— Actions 
MVS console 


EREP 


— Corrections 
D M=CHP (cc) 


D M=DEV(ddd) 


Incorrect use of CHNCFA frame. 
Channels should be configured 
off/on using MVS CF command 
when MVS is active. 

Wrong use of CU enable/disable 
switches — MVS VARY PATH 
commands should be used first. 
Control Unit IML. 

Switching of DASD interface on 
3814 prior to use of MVS 

VARY PATH command. 

Switching of control unit remote 
disable/enable switch function 
on 3814 prior to use of MVS 
"VARY PATH" command. 

Broken DPS array. 


Save the MVS console log for 
the hardware CSR. 

Print out a EREP report for: 
Events 

DASD OBR 

DASD Unsupported records. 


Determine control unit device 
address range on reported DPS 
out of sync path. 

Determine all paths to 

device that had DPS out of sync. 


V PATH(ddd—ddd,cc),online 


D M=CHP (cc) 


See on which "path" (CHPID) 
other devices report: 
"Operational path added to 


path group’. 


Determine all devices 
configured online to the DPS 
out of sync path. 


V PATH(ddd—ddd,cc),online 


Vary path online for all DPS 
devices shown online to the 
DPS out of syne CHPID. 

Note. Avoid varying online 
paths that may already be 
offline, they may be offline 
for other reasons. 


Check other sharing systems (for case of IML or 


interface switch cause). 


Figure 30. MIH Diagnostic Procedures DASD - DPS Out of Sync 
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Symptom 
Start Pending 


Note. If the following message also appears: 
IOSG52I (ddd,cc}) OPERATIONAL PATH ADDED TO PATH GROUP 
refer to Figure 30. 


- Reported detection. 
IOSO71I ddd,yy,xxxxxxx,START PENDING 
IOSO71I ddd,**,*MASTER*®,START PENDING 


— Causes 
Reserve — Lock out from another system 
Contention —- Device — Port — Cont — CU — Ch, 
contention from other devices. 
See Figure 34 on page 77. 
Outstanding ~— Recovery on other systems in 
Recovery progress, not complete or 
elected not to be done. 
IOCP — Split Control Unit definition. 
HW — Device — Control — Interface. 
Limitation 
— Actions 


Check for correct configuration 
on MIH detecting system. 

DEVSERV Paths command. 

For ALL systems sharing 

DASD observe — R — P — BSY status. 
Observe range of devices to check 
for interaction/contention. 


D M=CONF IG (xx) 


DS P,ddd,16 
D U,,,dd0,16[32 


D U,,ALLOC ,ddd,1l 


D R,L 


D GRS,C 
RMFMON 


OEM S/W Products 
IOCP list 


~— Corrections 


Vary online 
Cancel Job/user 


Perform Recovery 


Check RMF reports 


Change schedule 
Check Data set 
IOCP 

GRS 


On reserving (R) or busy (BSY) 
devices, find JOB user/s. 

On other systems, look for 
outstanding recovery — (i.e. 
previous message I0S427A) 
Find reserving JOB, using 

GRS ring or OEM equivalent 
SENQR (or PF9) on all systems 
to find the reserving JOB. 
Observe dynamic display 

when no contention. 


Return missing resources to 
configuration 

For reserve or contention — 
check with schedule. 

For outstanding messages. 


For reserve or contention — 
check "pending or disconnect’ 
times. (System Programmer). 
Change JOB schedule. 

Check dataset placement. 
Correct CU Macro. 

RNL setup (resource name list). 


Figure 31. MIH Diagnostic Procedures DASD - Start Pending 


IBM ES/309074tm. Complex Systems Recovery and Availability 


Symptom 


Missing CE/DE 


— Reported detection. 


- ses sweese 


—- Causes 


Contention 


Incomplete 
recovery 


HW 
Limitation 


— Actions 


Vary online 
resources 


V dd0—ddd,ONLINE 


D M=CONFIG(xx) 
DS P,ddd,16 


D U,,,dd0,16]32 


D U,,ALLOC,ddd,1 


— Corrections 


Cancel Job/user 
Check RMF reports 


Change schedule 


Device — Port — Cont — CU — Ch, 
contention from other devices. 
See Figure 34. 


Check other systems 


Device — Control — Interface 


Return missing resources to 
configuration. 


Vary on range of devices which 
may be effected by DPS array 
out of sync. 

Check for correct configuration. 


DEVSERV Paths command. 


For all systems sharing DASD 
observe — BSY status. 


On busy (BSY) devices, 
find JOB user/s. 


For contention — 
check with schedule. 


For contention — 
check ‘disconnect’ times. 


Change JOB schedule. 


Figure 32. MIH Diagnostic Procedures DASD - Missing CE/DE 
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Missing Interrupt Handler Recovery Actions 


When a missing interrupt condition has been detected: 


o STSCH to determine the progress of the Channel Program 

o HSCH to stop the operation Sdprentry in progress 

o STSCH | to determine the current state of the Channel Program 
o CSCH to clear indicators in the UCW 


o Issue one of the following messages: 
I0Ss071 START PENDING 
I0S071 MISSING CHANNEL END AND DEVICE END 
IOS071 IDLE WITH WORK QUEUED 
o Write MIH EREP record 
o Call DPS Validation 
SNID to determine the state of the DPS array 
SPID to correct the array 
Issue one of the following messages: 
IEA722 OPERATIONAL PATH ADDED TO PATH GROUP 
IEA722 NOT OPERATIONAL PATH TAKEN OFFLINE 
or 


IOS452 OPERATIONAL PATH ADDED TO PATH GROUP 


I0S452 NOT OPERATIONAL PATH TAKEN OFFLINE 
Write EREP record for "out of sync" condition 


o Redrive original 1/0 request 


Figure 33. MIH Recovery Actions (DPS DASD) 
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3380 Model AAG Port Structure 


Channels Channels 
System —> AA AA 
Interface Inter face 
3880 3880 
SD1l | SD2 SDl SD2 
< < 


Cont 0 | Cont 1 


- 3380—AA4 


Devices 0,1,8,9 share the same port 


Devices 4,5,C,D share the same port 


Devices 2,3,A,B share the same port 


Devices 6,7,E,F share the same port 


Figure 34. 3380 Model AA4 Port Structure 
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11.0 Dynamic Pathing 


Dynamic pathing devices are a class of I/O devices (currently 3380s and 3480s) that contain information in 
their controllers and storage directors about the systems to which they are attached. The information, that 
describes the paths leading to the devices from the various connected systems, is set up during the IPL 
process of any system connected to the device and can be changed by issuing MVS commands (such as 
VARY and CONFIG). The paths coming from the same system to the device together make up a path 
group. 


When a 3380 device is reserved to a system or a 3480 device is assigned to a system, there is an indication 
in the DPS Array information. 


Handling Dynamic Pathing Devices 


You may have a need to change the hardware configuration of DPS devices. This change can require re- 
moval or addition of paths by switching through switching units such as a 3814, enabling or disabling 
control unit interface switches, and so on. Improper handling of these situations can cause the control 
information stored in the DPS Array to become invalid or out of sync. This situation may also lead to losing 
a Reserve (3380) or an Assignment (3480). 


In general, observe the following rules when changing the configuration of DPS devices: 
e When removing a channel path to the devices you should: 


1. Take all the channel paths offline logically by varying the path offline to all the devices dependent 
upon It. 


Use the MVS command({s): 
VARY PATH(ddd—ddd,cc) ,OFFLINE 


2. When the path is offline, remove the path physically by switching it at the 3814 or disabling the 
interface switches. 


e When adding a path to DPS devices, use the reverse procedure; namely: 
1. Enable the path to the device physically. 
2. Use MVS commands to vary the path online logically. 


lf possible, never enable or disable control unit interfaces, re-IML control units, or power control units or 
DPS devices down without first issuing the correct MVS command to remove the devices and their paths 
from the system. If one of the above conditions does, nevertheless, happen, you must ensure that the DPS 
arrays are re-synchronized before continuing to use those devices. Issuing a ‘Vary Path’ command to a 
device or range of devices on a path will cause “DPS Validation’ to be invoked on the device or range of 
devices. 


DPS Validation will automatically rebuild the DPS arrays on DPS devices (DASD and tape) that it deter- 
mines are out-of-sync. The exception to this is when DPS Validation determines that a Reserve to a DASD 
device may be lost, in which case DPS Validation will ‘Box’ the device, and, in the case of tapes, if an 
Assignment has been lost, DPS Validation will again Box the device. Refer to “Device Boxing” on page 
65 . 
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Recognizing DPS Array Out-of-Sync 


If the DPS array for 3380/3480 devices becomes invalid (‘out of sync’), MVS (when notified) attempts to 
rebuild the DPS array information automatically by using DPS Validation. 


The following message indicates that MVS has detected a possible out-of-sync condition, and therefore, 
MVS will invoke DPS Validation. 


TOSO071I DEVICE ddd .... START PENDING 


The following message indicates successful DPS Validation. 


I0S203I CHANNEL PATH xx SUCCESSFULLY RECOVERED 


The following message indicates recovery of a DPS out of sync condition: 


TOS452I ddd,cc OPERATIONAL PATH ADDED TO PATH GROUP 


Proceed to recover other device paths on this CHPID. 


If the DPS Validation routine detects an error in the path during its process, the path is taken offline. One 
of the following messages will be generated: 


IOSOO1E ddd INOPERATIVE PATH cc 
IOS450E ddd,cc NOT OPERATIONAL PATH TAKEN OFFLINE 
IOS450E ddd,cc PERMANENT I/O, PATH TAKEN OFFLINE 


In some situations, the device may be boxed because: 


e There is no available path to the device. The following message is generated: 


IO0S451I ddd BOXED, NO ONLINE OPERATIONAL PATHS 


e A reserve (or assign) that was indicated in the DPS Array has been lost. One of the following mes- 
sages is generated: 


IOS451I ddd BOXED, RESERVE LOST 
IOS451I ddd BOXED, ASSIGN LOST 


e The process of removing (disbanding) the path group for the device and rebuilding (regrouping) it 
failed. In this case, no more I/O operations can be performed to the device. The following message 
is displayed: 


IOS451I ddd BOXED, DISBAND AND REGROUP OF PATH GROUP FAILED 


Pioeedure 

If some path went offline or a device was boxed during the DPS validation process, check that: 
@ The 3880/3480 was IMLed correctly. 

e The 3990/3880/3480 has the interface(s) enabled. 

e The device was correctly switched. 


If any one of the above conditions does not exist, fix the problem, and recover with the VARY PATH/DEVICE 
ONLINE commands. 


In other cases, the error is probably a hardware error. 
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Recovering a 3990/3380 Out of Sync Condition 


When the 3990 experiences a system-resetting event, its DPS arrays may be invalid. The 3990 notifies the 
operating system using Reset Notification on the next I/O operation down the affected path. MVS then in- 
itiates channel path recovery that includes DPS Validation to rebuild the arrays. The following message 
is displayed to indicate successful recovery, without deviation from what the Path Available Mask indi- 
cated. 


TOS203I CHANNEL PATH xx SUCCESSFULLY RECOVERED 


Procedure 

To recover from a 3990/3380 out of sync condition, proceed as follows: 

1. When DPS Validation finds a deviation or errors, the corresponding messages are displayed. 
2. Use ‘D M=CHP(xx)’ to determine the affected range. 

3. Use “DS P,ddd,n’ to display the device configuration status. 


4. Find the reason for reset notification. This might be a disabled interface, a 3814 switching problem, 
a CHPID being reconfigured using the system console, or a hardware error, among other causes. 


5. Run EREP and contact the hardware service representative if required. 


6. When the fault is corrected, vary the paths affected online with “VARY PATH(ddd,cc),ONLINE’ com- 
mand. 


Recovering a 3880/3380 Out of Sync Condition 


The DPS validation for a Start Pending condition will validate only the devices that IOS has had a missing 
interrupt for. If the start pending is due to an array out of sync condition, other devices are most likely af- 
fected, since the cause is typically related to an action on an 3880 SD or channel path. 


Depending on the device activity the system could take several minutes to several hours to validate and 
notify the operator of the problem. 


While the arrays are out of sync, performance is impacted because of the loss of dynamic path reconnect. 
In addition the installation is exposed to a failure or operator action on another path that might then result 
in undetected loss of reserves (data integrity exposure) or boxing of the device. 


It is therefore recommended that whenever a DPS array out-of-sync condition is detected for one device 
(as indicated by message 10S452l) the operator should initiate DPS Validation for the other devices. Refer 
to Figure 30 on page 73. 


Recovering a 3480 Out of Sync Condition 


The 3480 DPS can be inadvertently reset by: 
®  Disabling/enabling interfaces 
@ Incorrect device switching 
® Power off/on and IML of the control unit 
Note: Certain 3480 hardware errors cause the Control Unit to be IML’d automatically. 
If the Channel Subsystem selects a drive using the reset interface the 3480 will present unit check. The 


message following could be displayed 
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IOSOO0I ddd,cc,ASE,DB,0200,,**, label, jobname 
01485045000000200040( 33E4000000000000)0002( 00000000) 


The code 'ASE'(assigned elsewhere) is issued as a result of the 
3480 sense bytes. 


Procedure 
1. Use the DEVSERV command to subsystem status."DS P,ddd,8|16’ 
Correct the fault. 


Use the ‘Vary PATH(ddd-ddd,cc),online’ command to call DPS Validation. 


Rh OO ON 


The following messages should be displayed. 
IEE302I PATH(ddd,cc) ONLINE 


I0S452I (ddd,cc) OPERATIONAL PATH ADDED TO PATH GROUP 


5. Use the DEVSERV PATH command to display the status. 


DPS Device Messages - Problem Cross Reference 


The Cause Codes in Figure 35 provide an example of how cross-reference tables may be used to diagnose 
problems with 3380/3880 DPS devices. 


_ Figure 35 describes some common problems encountered when operating 3380/3880 devices as well as 
sample procedures to handle them. 


Figure 36 and Figure 37 provide a cross-reference of these problems and their related messages. They 
are intended to assist the operator in analyzing the console messages and in determining the possible 
failing components. 


These tables do not describe all possible problem and message situations. Customers wishing to use this 
- approach should expand and customize this section based upon their own hardware/software environment 
and recovery procedures. 


Method of use 


Compare the messages in Figure 36 and Figure 37 with the console messages. For each message, note 
the possible cause code, and then choose the ‘best fit’. 


Consider the following examples. You receive the following messages: 


IOSO01E ddd,INOPERATIVE PATH(s) pp 

IEA4661 PERMANENT IO ERROR, FAULT CODE=xxxx 

TEAG69E PATH(ddd,pp) HAS BEEN VARIED OFFLINE 

1084281 ddd,pp, HAS BEEN RECOVERED THROUGH CHANNEL PATH zz 

IOS450E ddd,pp NOT OPERATIONAL PATH TAKEN OFFLINE 

IOS444I DYNAMIC PATHING NOT REMOVED FROM DEVICE ddd/FROM PATH(ddd, pp) 


If you look in the tables (Figure 36 and Figure 37 ) you will find: 


lIOSOO1E matches cause code 1 
IEA4661 matches cause code 4 
IEA469E matches cause codes 1, 3 
lOS4281 matches cause code 1 
IOS450E matches cause codes 1,3 
108444] matches cause codes 1, 2, 3, 4, 5, 6, 8 


82 IBM ES/309082tm. Complex Systems Recovery and Availability 


From this, the conclusion is that the probable cause is cause 1 (Figure 35): an SD error with more than 


one path available. 


Cause| Description 
Code 
SD error 
1 > 1 path 
SD error 
2 last path 
CU error 
3 > 1 path 
CU error 
4 last path 


CHP error 
5 > 1 path 


CHP error 


pe last path 


CF CHP OFF 
last path 


CF CHP OFF, 
FORCE 


Le, CU BUSY with 


VARY or CF 


DPS Array 
10 Out—of—Sync 


Figure 35. Cause Codes 


Operator action(s) 


Isolate failing SD for CE repair. May cause 
performance degradation. If only one remaining 
path, transfer critical applications to backup. 


Isolate failing SD for CE repair. 
Identify and recover failing tasks. 


Isolate failing CU for CE repair. May cause 
performance degradation. If only one remaining 
path, transfer critical applications to backup. 


Isolate failing CU for CE repair. 
Identify and recover failing tasks. 


Isolate failing CHP for CE repair. May cause 
performance degradation. If only one remaining 
path, transfer critical applications to backup. 


Isolate failing CHP for CE repair. 
Identify and recover failing tasks. 


Check status of alternate paths, and vary 
online any paths that should be online. 
Otherwise, defer CF until alternate available. 


Check status of alternate paths,and vary online 
any paths that should be online. Defer CF if 
possible; otherwise, recover any failing tasks. 


Identify failing CU and system responsible. 
Consult local procedures for dealing with 
hung control unit. 


Determine range of affected devices. 


Re-synchronize DPS Array information for all 
affected devices. 
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CODE 


CAUSE 


MESSAGE 
TEA442E 
TEA447E 
TEA4661 
TEA469E 
IEEO971 
ITEE100E 
ITEE131D 
IEE1331 
IEE507D 
IEE5411 
IEE717D 
TEE7561 
ILROOQE 
IOSOOOT 
ITOSOO1E 
IOSO501 
TOS062E 


NOTE 1: 


Issued only in response to the CONFIG command. 


Message/Cause Cross Reference (Part 1 of 2) 


Figure 36. 
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= 


CODE 


CAUSE 


MESSAGE 
TOSO711 
IOS1001 
TOS1021 
IOS1041 
IOS105I 
IOS115A 
IOS162A 
IOS201E 
TOS2021 
T0S2031 
IOS2511 
TOS427A 
T0S4281 
TOS4291 
TO0S4441 
TOS450E 
TOS4511 
TOS4521 


Issued only when the device contains a PAGE data set. 


NOTE 2: 


Message/Cause Cross Reference (Part 2 of 2) 


Figure 37. 
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12.0 Write Inhibit 


Recognizing a Write Inhibit Condition 


MVS/ESA DASD Error Recovery Procedures (ERP) may fence a failing component on a path to prevent 
data corruption by invoking the write inhibit facility of the 3880 and 3990 storage director. 


When a DASD 1/O error occurs, the DASD ERP determines from the sense bytes which component is 
causing the error, and may write inhibit at one of the following levels: 


1. Channel interface 
2. Storage director 
3. Controller 


DASD ERP attempts to recover the failing I/O operation over an alternate path, if one exists. If the recovery 
is successful, the failing path is automatically varied offline. 
Messages Issued During Write Inhibit Processing 


The following messages indicate a Write Inhibit condition exists: 
TEA467E PATH (ddd,cc) WRITE INHIBITED (type) FOR ALL WRITE OPERATIONS 
TEA468E WRITE INHIBITED PATH (ddd,cc) ENCOUNTERED 
IEA469E PATH (ddd,cc) HAS BEEN VARIED OFFLINE 


TEA469E PATH (ddd,cc) CANNOT BE TAKEN OFFLINE 


Handling Write Inhibit Condition 


When a condition requiring Write Inhibit is detected, the following message is issued to the MVS operator 
console: 


IEA467E PATH (ddd,cc) WRITE INHIBITED (type) FOR ALL WRITE OPERATIONS 


The ‘type’ field in the message identifies the component for which the Write Inhibit condition has been 
established in the storage director. Three elements may be Write Inhibited: 


e Channel Interface 


» Any write through the channel interface of that storage director is inhibited. Other control units 
on the same channel interface are not affected. 

=» Only one interface to one system is affected. 

=» Write operations to the affected devices can be done over another path to this storage director 
or through an alternate storage director, if available. 
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e Storage Director 


» Any write through that storage director is inhibited. All systems using that storage director are 
| affected. 
» Write operations to the devices connected to that storage director can be executed through an 
alternate storage director, if available. 


e DASD Controller 


a Any write through that Controller ts inhibited. 

» All systems connected through the Controller are affected. 

» Write operations to the devices connected to that controller can be executed through the other 
controller of the head of string. 


Any attempt to write through the inhibited element is denied and the following message is issued to the 
MVS operator console: 


IEA468E WRITE INHIBITED PATH (ddd,cc) ENCOUNTERED 


The message that follows depends on whether there are alternate paths available for the device, or not: 


e lf an alternate path is available, and if the write is successful through that alternate path, the failing 
path is taken offline and the following message is issued: 


IEA469E PATH (ddd,cc) HAS BEEN VARIED OFFLINE 


e If no alternate path is available, the path is not.taken offline, and the following message is issued: 


IEA469E PATH (ddd,cc) CANNOT BE TAKEN OFFLINE 


Procedure 
Handle a Write Inhibit condition as follows: 
1. Ifthe error persists, issue: 
‘VARY PATH(ddd,cc),OFFLINE,UNCOND’ | 


to vary the path offline, assuming that the path is online but NOT allocated; otherwise, the command 
will fail. 


2. Report a Write Inhibit condition to the service representative and have the failing component fixed 
before attempting to use it. 


Removing a Write Inhibit Condition 


Report a Write Inhibit condition to the service representative and have the failing component fixed before 
attempting to use it. : 


After the failing component has been repaired, the Write Inhibit condition can be removed in one of two 
ways: 


1. Entering the ICKDSF “CONTROL ALLOWWRITE” Command. 


If this is not already available as a standard procedure in your installation, ask the system program- 
mer to prepare a job to run the ICKDSF program (Release 7 or higher), using the CONTROL 
ALLOWWRITE command to re-enable the affected storage director(s) for write operations. 


This is the normal procedure by which a storage director is Write Allowed after repair. 


2. IMLing the Storage Director. 
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The storage director may already have been IMLed by the CE as part of the repair action. Be careful: 
IMLing an active SD can lead to data integrity problems, so be sure that a/f paths from all sharing 
systems are offline to the storage director you intend to re-IML. 


After successfully completing either of the above actions, vary online all the paths to all the devices on the 
write-inhibited DASD subsystem. Use the DEVSERV command to verify that all the paths are (logically) 
online. 


You should perform the above procedure on all systems in the installation that are connected to the 
write-inhibited DASD subsystem. 


Because CONTROL ALLOWWRITE operates on a device basis, if more than one storage director to the 
device is failing, all storage directors must be repaired or tested before the CONTROL ALLOWWRITE 
command is issued. 


ICKDSF sends the CONTROL ALLOWWRITE command to every path to the selected device, whether the 
path is marked offline or not, except for reserved devices, in which case only the online paths are used. 
These are most probably not the paths that have to be reset, so the command will not be effective in re- 
moving the write inhibit condition. 


To overcome this problem, choose several devices and run ICKDSF to them. The probability of finding all 
of them reserved is very low. 


The sample JCL skeleton below causes the 3880 Storage Director attached to the 3380 volume with volume 
serial number xxxxxx to have its write inhibited indication removed: 


//jobname JOB 
//stepname EXEC PGM=ICKDSF 
//SYSPRINT DD SYSOUT=A 
/ /DDNAME DD UNI T=3380 , DISP=OLD , VOL=SER=vvvvvv 
//SYSIN DD * 
CONTROL ALLOWR DDNAMEC(CDDNAME ) 


whe 
@ 
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13.0 3990 Fencing 


Service Information Messages (SIM) 


Depending on the type of error condition, the 3990 DASD control unit may prevent the failing component 
from being used by the system. Preventing the use of the component is referred to as ‘fencing’. Notifi- 
cation of the initial fencing ts through the use of a “SIM ALERT’ message. 


A SIM ALERT message (Figure 38) is displayed on the operator’s console to notify the operators that a 
3990 service information message (SIM) has been written to the error recording data set (ERDS). 


The 3990 sends the SIM to the host system that issues the next I/O operation (this may be a different host 
system than the one that was performing the I/O operation when the SIM occurred). 


When a SIM ALERT is displayed with a severity other than SERVICE, (Figure 39 and Figure 40) it is es- 
sential that you run an EREP exception report to get the additional information from the ERDS as quickly 
as possible. 


If a repair action is not completed, the 3990 re-issues the SIM format sense data eight hours after the first 
SIM offload and then again eight hours after the second SIM offload to the host. After the last SIM offload 
to the host, the SIM format sense data is marked on the SIM log and the data is offloaded to the host for 
the finai time wheiher a repair action is started or not. 


IEAG80 Ocuu,xxxx,yyy ALERT, MT=mmmmmmm, SER=04Gaa—ddddddd, REFCODE=nnnn nnnn nnann 
MVS SIM ALERT FORMAT 


DMKDADG03I Ocuu,xxxx,yyy ALERT, MT=mmmmmmn, SER=04aa—ddddddd, REFCODE=nnnn nann nnnn 


VM/SP and VM/SP HPO SIM ALERT FORMAT 


HCPERPG03I Ocuu,xxxx,yyy ALERT, MT=mmmmmmm, SER=04¢aa—ddddddd, REFCODE=nnnn nnann nnnn 


VM/XA SIM ALERT FORMAT 


Figure 38. SIM Alert Format Message Examples for MVS and VM Environments 
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Message Field. 


Qcuu 


yyyyyyy 


MT =m 


SER=04aa—ddddddd 


REFCODE= nnnn 
nnnn nnnn 


Figure 39. 


Failing 


SCU A service-related 
fault occurred that 
does not affect 
storage path 


Severity: 
Component SERVICE 


Description 


Identifies the channel/unit address of the failing 


storage control. 
Identifies the Failing component. 


SCU specifies the fault occurred in the non cache 
portion of the storage control. 


CACHE specifies the fault occurred in the cache or 


portion of the storage control. 


Identifies the severity of the failure. 
can be ACUTE, SERIOUS, MODERATE, or SERVICE. 


Identifies the machine type and model number. 
Identifies the serial number of the failing unit. 


Identifies the reference code that the service. 
representative will need to repair the fault. 


MVS and VM SIM Alert Fields 


Severity: 
SERIOUS 


Severity: 
MODERATE 


A permanent error 
occurred on one 
storage path. One 
storage path 


A storage cluster 
temporary error 

threshold has been 
exceeded, but both 


The severity 


Severity: 
ACUTE 


A permanent error 
occurred on both 
storage paths in 
this cluster. 


remains 
operational. 


storage paths are 
operational. 


operation. 


A permanent error 
disabled cache or 
nonvolatile 
storage. 


A permanent error 

occurred on 1 of 4 occurred on 2 of 

cache or 4 cache or 

nonvolatile storage nonvolatile 

access paths. storage access 
paths. 


A cache or 
nonvolatile storage 
temporary error 
threshold has been 
exceeded, but the 
storage resource 

is operational. 


A permanent error 


Figure 40. Meaning of the SIM Alert Field by Failing Component 


Types of Fencing 


The 3990 modifies and enhances the fencing of the 3880 as follows: 
® Fence channel 


When either a non-resetable error or error threshold exceeded condition occurs, the channel is fenced 
from that storage path. If the error occurs on both paths, the channel is fenced from the entire storage 
cluster (in both DLS and DLSE modes). 


e Fence storage path 


This fence is similar to the fence storage director operation of the 3880, and replaces it. However, note 
the following differences: 


a In DLSE mode, there are two paths in a storage director (cluster), so it is much less likely that a 
fault will cause an entire storage director (cluster) to be fenced. 


=» Some errors that do not cause a 3880 or a 3990 storage path in DLS mode to fence cause a 3990 
storage path in DLSE mode to fence, if a threshold of these errors is exceeded and the other 
storage path in the storage director (cluster) is operational. | 
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» Ona 3990 Model 3, some cache errors cause a storage path to be fenced so that the rest of the 
Storage paths in the subsystem can continue to use the cache. 


® Fence the device from a storage path in DLSE mode 


A device is fenced from a storage path if there is an alternate path to a device within the storage di- 
rector and a threshold of errors that appear to be path related is exceeded on that device. 


DEVSERV PATH Message 


The DEVSERV PATH command is used to display the channel and storage path status for a device. Be- 
cause the DEVSERV PATH is already a lengthy console display, the fence display line will be included only 
when a fence is detected. 


Message IEE459I below includes the indications shown when one or more channel paths, storage paths, 
or devices have been fenced. 


A complete description of the message can be found in MVS/ESA System Messages, Vol 2. 
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IEE459I 22.16.30 DEVSERV PATHS 023 
UNIT DTYPE M CNT VOLSER CHPID=PATH STATUS 
120,3380J3,0,000,PS3803, O5=+ 45=+ 


*#* FENCE yyyyyVyyy XXXXXXXX XXXXKXXX XXXXXXXX XXXXXXXX 
121,3380J,0,000,CB3800, 05=+ 45=+ 
*#* FENCE yyyyyyyy XXXXXXXX XXXXXXXX XXXXXXXK XXXXXXXX 
122 ,3380J ,0,000 ,CF3811, 05=+ 45=+ 
** FENCE yyyyyyVy XXXXXXXX XXXXXXXX XXKXXXXK XXXXXXXX 
123,3380J3,A,001,PRV004, O05=+ 45=+ 
** FENCE yyyyYyVYVy XXXXXXXKX XXXXXXXK XXXXKXXK XXXXXXXX 
Sede eR EERE ERE SYMBOL DEFINITIONS eter eee ERREREER EEE 
A = ALLOCATED + = PATH AVAILABLE 
O = ONLINE | 
Where yyyyyyyy = STORAGE PATH or CHANNEL or DEVICE 
Where xxxxxxxx...= 4 bytes of the path status for each 
| storage path (0,1,2 and 3) 
Bytes 0-3: STATUS OF STORAGE PATH 0 
BYTE DEFINITION 
0 STORAGE PATH STATUS 
BIT DEFINITION 
0 1 = BYTES 1-3 contain valid values; 0 = SP not installed 
1 Device attaches through this storage path 7 
2 1 = BYTES 1-3 invalid; 0 = SP disabled 
3 Device permanently fenced from this SP (4—way) 
4 Command received on this path 
5-7 ID of channel requesting status; O=CH A, 1=CH B, etc 
BYTE | 
1 BIT map of channels configured in this cluster 
2 BIT map of channels enable/disable switches 
3 BIT map of channels fenced from this storage path 
4—7: STATUS OF STORAGE PATH 1 
The byte definitions are the same as for bytes 0-3 for storage path 0 
but apply to storage path 1. 
8-11: STATUS OF STORAGE PATH 2 
The byte definitions are the same as for bytes 0-3 for storage path 0 
but apply to storage path 2. 
12-15: STATUS OF STORAGE PATH 3 
The byte definitions are the same as for bytes 0-3 for storage path 0 
but apply to storage path 3. 
Unfencing 


We suggest that you have the service representative review and repair the hardware error first before 
removing the fence. 


Types of Unfencing for 3990s in DLSE Mode include: 
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e Channel fence 


Do one of the following: 


Use ICKDSFE to unfence the channel. 


Press the Restart switch. This should be done only by authorized personnel, following procedures 
approved by the installation. 


For certain conditions, the service representative may determine that it is appropriate to power 
the cluster off, then on again to remove the fence. 


e Storage path fence 


Do one of the following: 


Use ICKDSF to unfence the storage path. 


Press the Restart switch. This should be done only by authorized personnel, following procedures 
approved by the installation. , 


For certain conditions, the service representative may determine that it is appropriate to power 
the cluster off, then on again to remove the fence. 


e Device fence 


Do one of the following: 


Use ICKDSF to unfence the storage path. 


Press the Restart switch. This should be done only by authorized personnel, following procedures 
approved by the installation. | 


For certain conditions, the service representative may determine that it is appropriate to power 
the cluster off, then on again to remove the fence. 


Types of unfencing for 3990s in DLSE/DLS Mode include: 


e MVS Path Vary 


Use the MVS VARY PATH command. 


DASD controller fence 


The 3990 also attempts to unfence the controller if one of the actions to unfence the storage path is 
taken. The controller repair action unfences the controller. 


e §=6Write inhibit 


Use ICKDSF to reset the Write Inhibit. Refer to “Write Inhibit” on page 87. 


The 3990 also resets the Write Inhibit if one of the actions to unfence a storage path is taken. 


Write Inhibit channel is reset if a system reset is received on the channel. 


Note: 


Each installation should document the operational procedures for using the Restart switch. We 


suggest you review and modify as appropriate your procedures for unfencing and using the Restart switch. 
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ICKDSF CLEARFENCE Command 


Maintenance is required on the failing storage control the failing device, or both. After the failing unit has 
been repaired, use ICKDSF CONTROL command with the CLEARFENCE parmameter to clear the condition 
for the path. This action clears all paths to all devices on the subsystem. The specified device can be any 
device on the subsystem. 


The following job skeleton shows the use of the CLEARFENCE command: 


Example //jobname JOB 
//stepname EXEC PGM=ICKDSF 
//SYSPRINT DD SYSOUT=* 
/ /DDNAME DD UNIT=3380 ,DISP=OLD , VOL=SER=vvvvvv 
//SYSIN DD * 
CONTROL CLEARFENCE DDNAME(DDNAME) 
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14.0 Unconditional Reserve 


Unconditional Reserve is a channel command used during some MVS/ESA recovery situations to break a 
hung allegiance between a device and a channel path. It is used for DASD devices and control units that 
do not permit use of the “Reset Allegiance’ command, or, in cases where the Reset Allegiance command 
has not managed to break the allegiance. For example, if a storage director fails while data transfer is in 
progress between a device and the channel through that storage director or storage path, then an alle- 
giance is maintained that prevents the device from being accessed over any other channel path. That is, 
the device appears ‘busy’ to any attempt to access it. 


MVS/ESA initiates unconditional reserve processing without operator intervention providing a data integ- 
rity exposure does not exist. When a malfunction is detected during an I/O operation to a DASD device, 
such as an interface control check, and this device is reserved by the detecting system, the system proc- 
esses the Unconditional Reserve. The operator is notified of the result by one of the following messages: 


I0S428I — when recovery is successful — 
IO0S429I — when recovery is unsuccessful — 


When MVS/ESA detects a hung allegiance and is unable to determine whether there is an integrity expo- 
sure, it issues message 1!0S427A during the recovery process. 


MVS/ESA issues this message only when all of the following conditions exist: 

¢ An iFCC nas occurred during an i/O operation to a DASD device thal may have resuiled in the device 
now maintaining allegiance (that is, the device may not be accessible to the detecting or sharing 
systems); and 

e The DASD device in error recovery has been specified as “shared ‘; and 


e The DASD device in error recovery is not reserved on the detecting system; and 


e MVS is unable to determine the reserve status of the device on other systems. Under normal cir- 
cumstances, MVS/ESA retrieves information about the device (including whether it is reserved) from 
the DPS arrays in the DASD/control unit. 


If the above conditions exist, MVS/ESA requests the recovery option from the operator through message 
l0S427A before issuing the Unconditional Reserve, so as to prevent breaking an allegiance that may be 
genuinely held by another system. 


The messages discussed in the following sections may be issued by MVS/ESA. 
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Message |0S427A 


10S427A ddd,cc, component FAILURE. REPLY WITH UR, BOX OR NOOP. 


Where: 
ddd = Device number 
xe = Channel path id 
component = One of the following: 


Channel Path 
Control Unit 


The possible responses to message 10S427A are: 
© UR = 


This reply should be the preferred option if the device is not in use by sharing systems, or if data in- 
tegrity is not important, 


Unconditional Reserve recovery action is attempted and therefore a device that has maintained alle- 
giance to the broken path may have its allegiance to the broken path reset. 


In a shared environment, this reply can steal the reserve condition from another system if the correct 
recovery procedure is not followed prior to replying with UR, and therefore, data integrity may be im- 
paired. 


e BOX 


_All the I/O requests to the device are posted complete with an error, which may result in job termi- 
nation. The net result is that the device may appear busy to the other systems - that is, as message 
lOS0711 START PENDING. 


e NOOP 


The Unconditional Reserve recovery code is not executed, and therefore the response may result in 
not clearing an allegiance condition. The net result is that the device may appear busy to further I/O 
operations: 


=» To the other paths on this system (issuing 10S427A) 
=» To the other systems - that is, as message !0S071I START PENDING. 


If the same error persists, the message is presented again. If other types of errors are generated by 
the failing component, the path or device should be taken offline. 


Procedure when Replying ’UR’ 
1. Quiesce all sharing systems and system images in the case of a complex operating in LPAR mode. 


Issue the following MVS command: 


QUIESCE 


Wait for the quiesced system to enter a wait state “CCC’. 
2. Reply UR to message 10S427A on the detecting system. 
Wait for message 10S428I or 10S4291 to be issued. 


3. Restart all sharing systems. 
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Message 10S428] 


I0S428I ddd,cc, HAS BEEN RECOVERED THROUGH CHANNEL PATH zz 


Where: 
ddd = Device number 
cc = Channel path id 
ZZ = Channel path id 


This message is issued when Unconditional Reserve processing has successfully recovered the device 
through channel path ‘zz’. Unconditional processing was initiated either as a result of the operator re- 
sponding ‘UR’ to message !0S427A, or automatically when the system detected a hardware condition as- 
sociated with device ‘ddd’ and the device was reserved to the detecting system. 


This message may indicate a hardware failure along channel path ‘cc’, and should be reported to the 
service representative. 


Message lOS429] 


IO0S429I ddd,cc, COULD NOT BE RECOVERED THROUGH AN ALTERNATE 
CHANNEL PATH 


Where: 
ddd = Device number 
cc = Channel path id 


This message is issued when Unconditional Reserve processing was not able to recover device ‘ddd’ 
through an alternate channel path. Unconditional processing was initiated either as a result of the oper- 
ator responding ‘UR’ to message !0S427A, or automatically when the system detected a hardware condi- 
tion associated with device ‘ddd’ and the device was reserved to the detecting system. 


The Unconditional Reserve processing was unsuccessful in recovering the device through an alternate 
channel path for one of the following reasons: 


1. No alternate channel paths were available for the device. 
2. All alternate channel paths were unsuccessful in recovery. 


3. The Unconditional Reserve command is not supported by the DASD hardware associated with the 
device. 


4. No Unconditional Reserve was requested. 


This message may indicate a hardware failure along channel path ’cc’, and should be reported to the 
service representative. 


Page Data Set Volume Error 


If an I/O error condition requiring Unconditional Reserve recovery occurs on a device containing a page 
data set, IOS performs Unconditional Reserve processing through the use of DCCF: 


I0S427A ddd,cc, component FAILURE. REPLY WITH UR, BOX OR NOOP. 


Where ‘ddd’ is the device number, and ‘cc’ is the channel path. 


Note: All MVS processing may be suspended until the operator replies to this message. 
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The possible replies are: 


UR = MVS uses the Unconditional Reserve. 


This option must be used with care since the Unconditional Reserve may steal the reserve ownership 
from sharing systems, but page data set volumes should normally be dedicated to a specific system; 
consequently, you can expect that no sharing system has a reserve on this device. 


Procedure when Replying ’UR’ 


1. Quiesce all sharing systems and system images in the case of a complex operating in LPAR 
mode. 


Use the following MVS command: 


QUIESCE 


Wait for the quiesced system to enter a wait state “CCC’. 
2. Reply UR to message 10S427A on the detecting system. 
Wait for message 10S428! or 1|OS429I to be issued. 
3. Restart all sharing systems. 
BOX = The device is boxed. 


The device is boxed and the page data is flagged bad. This means that all users with one page (or 
more) on this device Abend with a system code of 028. A “VARY ddd,ONLINE’ command does not 
reactivate the use of the page data set. A new IPL may not solve the problem. Maintenance service 
intervention may be required. This option is not recommended. 


NOOP = The 1/O operation is retried. 


If the message recurs, check whether other devices on this path have similar problems (10S427A or 
lOSOSO!I channel detected error). If so, remove the failing path using VARY PATH(xxx,yy),OFFLINE for 
the device, or take the channel path offline using the command CF CHP(cc),OFFLINE. 


If the message can not be issued, or if it times out, it will be written to the system console. 


If the problem recurs or a restartable wait state “O6F’ is loaded, use procedure “PDO6” on page 25 to re- 
cover. 
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15.0 Hot I/O 


Hot I/O occurs when a possible hardware malfunction presents unsolicited status from a device to the 
Channel Subsystem for enabled subchannels that are not ‘status pending’. 


The Channel Subsystem presents the unsolicited status to MVS. MVS keeps track of the number of unso- 
licited status conditions presented, and when the number for the device exceeds the threshold value (de- 
fault of 100, or user-assigned), MVS recognizes a Hot I/O condition. 


The Hot I/O recovery actions to be used may be tailored by the installation in SYS1.PARMLIB member 
IECIOSxx. 


It is possible to specify the following automatic recovery actions: 

BOX force the device offline 

CHP,K attempt channel path recovery 

CHP,F force the channel path offline 

OPER obtain recovery option from the operator through Hot I/O message 


The following recovery actions are available to the operator in response to the Hot I/O DCCF message 
when the recovery option specified in PARMLIB member IECIOSxx is “OPER’. 


e Box the device 
e Fence the control unit - boxing all devices on control unit 
e Request channel path recovery 


e Force the channel path offline 


Recognizing Hot !/O 


MVS notifies the operator of a Hot I/O by: 


@ Issuing one or more of the messages described in the following Hot |/O messages sections. Through 
the messages, the following information is communicated to the operator: 


» Which device, or range of devices, is Causing the Hot I/O condition 
» Over which channel path the last interrupt occurred for the device 
=» Whether any recovery action was requested 
=» Whether any recovery action was attempted 
=» The results of the recovery action 

e Loading a restartable wait state. 


If “OPER’ is specified, the recovery action for Hot I/O processing is obtained from the operator by issuing 
one of the following Hot I/O messages: 


e 10S110A Hot I/O on Non-DASD device 
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e 108111A Hot I/O on Non-Reserved DASD or Non-Assigned DPS device 
e |08112A Hot I/O on Reserved DASD or Assigned DPS device 


Recovering from Hot I/O 


This section describes how to respond to the various Hot I/O messages. 


Hot I/O Message - IOS109E 


Message IOS109E is issued to notify the operator that a Hot I/O condition has been detected, and that re- 
covery is automatically initiated. The recovery action invoked is determined by the IECIOSxx specification. 


The message has the following form: 


IOS109E HOT I/O RECOVERY option INITIATED FOR DEVICE ddd, 
CHPID cc 


‘option’ may be one of the following: 


BOX 
CHP ,F 
CHP ,K 


A subsequent message (one of those listed below) indicates the results of the Hot I/O recovery processing. 


While no immediate operator response is required for message IOS109E, the operator should consider the 
impact of a recurring Hot I/O condition for that device on the system. If the Hot I/O condition is not cleared 
by the initial recovery action, the ‘recursive’ Hot I/O error recovery option is invoked. 


Message - 1OS102!/ 
IOS102I DEVICE ddd BOXED, OPERATOR REQUEST 


Procedure 


This message (l0S1021) is issued following |OS109E when the automatic recovery action initiated for the 
detected Hot 1/O condition is ‘BOX’. The operator should: 


1. Consider fencing the entire control unit by forcing offline the range of attached devices if this condition 
occurs for more than one device on a control unit. 


2. Report the problem to the service representative. 


3. Refer to “Device Boxing” on page 65. After the hardware problem has been corrected, the boxed 
device/s may be brought online using the following command: 


VARY ddd,ONLINE ~ 


If being returned from an offline boxed state, or 


VARY ddd,ONLINE , UNCOND 


if being returned from an online boxed state. 
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The preferred method is to first take the device completely offline and then bring it back online. 


Message - 1OS202/ 


TOS202I CHANNEL PATH cc FORCED OFFLINE 


Procedure 


This message (lOS202!) is issued following 1|OS109E when the automatic recovery action initiated for the 
detected Hot I/O condition is ’CHP,F’, or when channel path recovery, initiated as a result of specifying 
‘CHP,K’, is unsuccessful. The operator should: 


1. Refer to “Channel Path Recovery” on page 113 for a description of handling channel path recovery. 
2. Report the problem, including the full text of the messages, to the service representative. 


3. After the hardware problem has been corrected, recover the channel path by issuing the following 
command: 


CF CHP(cc), ONLINE 
Message - 108203! 


IOS2031I CHANNEL PATH cc SUCCESSFULLY RECOVERED —DEVICE IS: ddd | UNKNOWN 


Procedure 


This message (10S203)) is issued following [OS109E when the automatic recovery action initiated for the 
detected Hot I/O condition is “CHP,K’ and the channel path recovery is successful. 


If the Hot I/O condition is not cleared by the channel path recovery processing, the recursive Hot I/O error 
recovery option is invoked: 


1. If the Hot I/O device is a console on the same control unit as the master console, box the device by 
using the following command: 


V ddd,OFFLINE,FORCE 


2. If this condition occurs for more than one device on a control unit, the operator should consider fenc- 
ing the entire control unit by forcing offline the range of attached devices. 


3. Report the problem to the service representative. 


Hot I/O Message - 10S110A 


The message shown in Figure 41 is issued using DCCF. 
This message is issued when a Hot I/O condition has occurred on a non-DASD, non-DPS device and either: 


e The installation has indicated through PARMLIB that the operator should specify the recovery action 
for the device, 


or 
e The Hot I/O condition has persisted, despite previous attempts at recovery. 


Note that if the response to the message is not received within 125 seconds, a DCCF timeout occurs, and 
the message is written to the system console. (The Processor Controller alarm sounds to alert the oper- 
ator.) The operator can then reply to the Hot I/O message from the SCPMSG frame of the system console. 


Procedure 


The operator should proceed as follows: 
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1. Refer to the installation’s recovery procedures for the device indicated in the 10S110A message text. 


The appropriate recovery action for a particular device may vary from installation to installation de- 
pending on the configuration and use of the device. 


2. Attempt to correct the problem by specifying the least-impacting recovery actions first. 


Reply with one of the following: 


NONE requests IOS to do no recovery 

DEV requests 1OS to box the device 

CU requests IOS to box all devices in the range specified 
CHP,K requests |OS to attempt channel path recovery 


CHP,F forces the channel path offline 
3. After the response has been received by MVS, MVS issues the message: 
PRESS CANCEL KEY TO RESTORE DISPLAY 


The PA2 key is the Cancel key on 3270 consoles. 


IEE127I THE FOLLOWING MESSAGE IS ISSUED THROUGH DISABLED CONSOLE FACILITY 
I0S110A IOS HAS DETECTED HOT I/O ON DEVICE ddd (NON—DASD). THE LAST 
INTERRUPT FROM THIS DEVICE WAS ON CHANNEL PATH xx. THE SCD 
IS AT aaaaaaaa. THERE ARE nn DEVICES WITH HOT I/O ON CHP xx. 


ENTER ONE OF THESE REPLIES TO TELL IOS HOW RECOVERY IS TO 
BE HANDLED: 


NONE THIS REPLY TELLS IOS THAT (1)THE OPERATOR DID NOT PHYSICALLY 
REMOVE ANY DEVICE OR CONTROL UNIT (HE MAY OR MAY NOT HAVE 
RESET THE DEVICE) AND (2) IOS SHOULD NOT REMOVE ANY DEVICE 
AND NOT ATTEMPT ANY CHANNEL RECOVERY. 


DEV THIS REPLY TELLS IOS TO LOGICALLY REMOVE (BOX) THE DEVICE. 
(THE OPERATOR MAY OR MAY NOT HAVE PHYSICALLY REMOVED THE DEVICE) 


CU ‘THIS REPLY TELLS IOS THAT THE OPERATOR PHYSICALLY REMOVED THE 
CONTROL UNIT. THE REPLY MUST INCLUDE THE NUMBER OF EACH DEVICE 
ON THE CONTROL UNIT. FOR EXAMPLE, IF DEVICES 25E, 250 THRU 257 
REPLY: CU,250:257,25E 
OR 
CU,25E,250:257 


CHP,K THIS REPLY TELLS IOS (1) TO ATTEMPT RECOVERY FOR THE CHANNEL 
PATH NAMED IN THE MESSAGE, AND (2) IF RECOVERY IS SUCCESSFUL, 
TO KEEP THE CHANNEL PATH ONLINE. 

CHP,F THIS REPLY TELLS IOS TO FORCE THE CHANNEL PATH OFFLINE. 


R 0, 


Figure 41. Hot I/O Message 1OS110A 
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Hot 1/O Message - 1OS111A 


The message shown in Figure 42 is issued using DCCF. 


This message is issued when a Hot I/O condition has occurred on a non-reserved, or non-assigned 
DASD/DPS device and either: 


e §=6The installation has indicated through PARMLIB that the operator should specify the recovery action 
for the device, | 


or 
e The Hot |/O condition has persisted, despite previous attempts at recovery. 


Note that if the response to the message is not received within 125 seconds, a DCCF timeout occurs, and 
the message is written to the system console. (The Processor Controller alarm sounds to alert the oper- 
ator.) The operator can then reply to the Hot I/O message from the SCPMSG frame of the system console. 


Procedure 
The operator should proceed as follows: 
1. Refer to the installation’s recovery procedures for the device indicated in the 1|OS111A message text. 


The appropriate recovery action for a particular device may vary from installation to installation de- 
pending on the configuration and use of the device. 


2. Attempt to correct the problem by specifying the least-impacting recovery actions first. 
Reply with one of the following: 
NONE requests 1OS to do no recovery 
DEV requests IOS to box the device 
CHP,K requests 1OS to attempt channel path recovery 
CHP,F forces the channel path offline 
3. After the response has been received by MVS, MVS issues the message: 
PRESS CANCEL KEY TO RESTORE DISPLAY 


The PA2 key is the Cancel key on 3270 consoles. 
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IEE127I THE FOLLOWING MESSAGE IS ISSUED THROUGH DISABLED CONSOLE FACILITY 

IOS111A IOS HAS DETECTED HOT I/0 ON (DASD| ASSIGNABLE) DEVICE ddd. THE 
LAST INTERRUPT FROM THIS DEVICE WAS ON CHANNEL PATH xx. THE SCD 
IS AT aaaaaaaa. THERE ARE nnn DEVICES WITH HOT I/O ON CHP xx. 


ENTER ONE OF THESE REPLIES TO TELL IOS HOW RECOVERY IS TO 
BE HANDLED: 


NONE THIS REPLY TELLS IOS THAT (1)THE OPERATOR DID NOT 
PHYSICALLY REMOVE ANY DEVICE OR CONTROL UNIT (HE 
MAY OR MAY NOT HAVE RESET THE DEVICE) AND (2) IOS 
SHOULD NOT REMOVE ANY DEVICE AND NOT ATTEMPT ANY 
CHANNEL RECOVERY. 


THIS REPLY TELLS IOS TO LOGICALLY REMOVE (BOX) THE 
DEVICE. (THE OPERATOR MAY OR MAY NOT HAVE PHYSICALLY 
REMOVED THE DEVICE. ) 
THIS REPLY TELLS IOS (1) TO ATTEMPT RECOVERY FOR THE 
CHANNEL PATH NAMED IN THE MESSAGE, AND (2) IF RECOVERY 
IS SUCCESSFUL, TO KEEP THE CHANNEL PATH ONLINE. 

CHP,F THIS REPLY TELLS IOS TO FORCE THE CHANNEL PATH OFFLINE. 


Re 0 4455 


Figure 42. Hot I/O Message !1OS111A 


Hot I/O Message - 10$112A 


The message (10S112A) shown in Figure 43 is issued using DCCF. 


This message is issued when a Hot I/O condition has occurred on a reserved DASD device or an assigned 
DPS device and either: 


e The installation has indicated through PARMLIB that the operator should specify the recovery action 
for the device, 


or 
e The Hot I/O condition has persisted, despite previous attempts at recovery. 


Note that if the response to the message is not received within 125 seconds, a DCCF timeout occurs, and 
the message is written to the system console. (The Processor Controller alarm sounds to alert the oper- 
ator.) The operator can then reply to the Hot I/O message from the SCPMSG frame of the system console. 


If you receive this message, the system is not able to recover from a hot I/O on a reserved shared device. 


This is potentially a serious problem. Data integrity can be impaired if you choose to box the reserved 
device. Check to see what the situation of this device is on this and on the sharing systems. If the device 
contains critical data, vary it offline from the sharing systems before replying ‘BOX’. 
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Procedure 


The operator should proceed as follows: 


1. 
2. 


Refer to the installations recovery procedures. 


Attempt to correct the problem by specifying the least-impacting recovery actions first. 


Reply with one of the following: 


NONE 
DEV 
CHP,K 
CHP,F 


requests IOS to do no recovery 


requests IOS to box the device 


requests IOS to attempt channel path recovery 


forces the channel path offline 


MVS will issue the message: 


PRESS CANCEL KEY TO RESTORE DISPLAY 


The PA2 key is the Cancel key on 3270 consoles. 


IEE127I THE FOLLOWING MESSAGE IS ISSUED THROUGH DISABLED CONSOLE FACILITY 
I0S112A IOS HAS DETECTED HOT I/O ON (RESERVED | ASSIGNED) DEVICE ddd. THE 
LAST INTERRUPT FROM THIS DEVICE WAS ON CHANNEL PATH xx. THE SCD 


R 0,.. 


IS AT 


ENTER 


aaaaaaaa. THERE ARE nnn DEVICES WITH HOT I/O ON CHP xx. 


ONE OF THESE REPLIES TO TELL IOS HOW RECOVERY IS TO 


BE HANDLED: 


NONE 


DEV 


CHP ,K 


CHP ,F 


THIS REPLY TELLS IOS THAT (1)THE OPERATOR DID NOT 
PHYSICALLY REMOVE ANY DEVICE OR CONTROL UNIT (CHE 
MAY OR MAY NOT HAVE RESET THE DEVICE) AND (2) IOS 
SHOULD NOT REMOVE ANY DEVICE AND NOT ATTEMPT ANY 
CHANNEL RECOVERY. 


THIS REPLY TELLS IOS TO LOGICALLY REMOVE (BOX) THE 
DEVICE. (THE OPERATOR MAY OR MAY NOT HAVE PHYSICALLY 
REMOVED THE DEVICE. ) 


THIS REPLY TELLS IOS (1) TO ATTEMPT RECOVERY FOR THE 
CHANNEL PATH NAMED IN THE MESSAGE, AND (2) IF RECOVERY 
IS SUCCESSFUL, TO KEEP THE CHANNEL PATH ONLINE. 


THIS REPLY TELLS IOS TO FORCE THE CHANNEL PATH OFFLINE. 


Figure 43. Hot i/O Message 10S112A 


Handling Hot I/O Wait States 


When a Hot I/O condition is detected, MVS cannot proceed with recovery until a recovery action is speci- — 
fied. As mentioned above, MVS obtains the recovery action from PARMLIB, if one has been specified. If 
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‘OPER’ is specified, MVS requests the recovery action from the operator through the message. If the op- 
erator is not able to respond to the message, either because the response is locked out, or the time limit 
of 125 seconds expired, the message is written to the system console, the alarm is sounded, and the op- 
erator can respond there. Refer to “Disabled Console Communication Facility” on page 59 for a de- 
scription of DCCF. 


If communication with the MVS console and the system console fails, a restartable wait state is loaded on 
the possible that a restartable wait state is loaded on the CP where the Hot I/O condition was detected. 
This is the only way MVS is now able to obtain the correct recovery action from the operator. 


Depending on the category of device with the Hot I/O condition, one of the following restartable disabled 
wait states are loaded (see “Disabled Wait States” on page 7): 


WAIT 110 {corresponding to Hot I/O message 10S110A) 
WAIT 111 (corresponding to Hot I/O message 10S111A) 
WAIT 112 {corresponding to Hot I/O message 105112A) 


Note: The other processors may load a disabled WAIT A22. They are restarted automatically when the 
CP in the Hot |/O wait state is restarted. 


Use the following procedure to locate the device number, to correct the error, and to restart the system if 
a wait condition (WAIT110, WAIT111, WAIT112) occurs. 


Procedure 
At the system console, do the following : 


1. If operating in LPAR mode, enter ‘SETLP Ipname’ where Ipname is the name of the logical partition 
that has entered the disabled wait state that was indicated in the system console priority message. 


2. Display the ALTER/DISPLAY frame, by issuing: 
F ALTCP 
from the system console, or by selecting 02 on the INDEXO frame. 


3. Onthe ALTCP frame (Figure 44), select the CP that detected the Hot I/O; that is, the CP in the WAIT11x 
(as indicated in the system console priority message for the 11x wait state). 


4. Enter: 

A2 (Display) 

An arrow will appear in front of A2. 
9. Then enter: 

B3 (Primary Virtual Storage) 

An arrow will appear in front of B3. 
6. Then enter: 

40C at “Address(hex) = >’ 


and record the contents of location “40C’ (‘aaaaaaaa’), which is the address of the status collection 
data (SCD) area. 
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ie 


CPs 35 

ALTER / DISPLAY CP 

A= FUNCTION ADDRESS 
1. ALTER 40C 

—> 2. DISPLAY 40C 
410 

B= FACILITY 410 
1. REAL STORAGE KEY 420 

2. REAL STORAGE 420 

—> 3. PRIMARY VIRTUAL STG 430 
4. SECONDARY VIRTUAL STG 430 

5. GENERAL REGISTERS 440 

6. CONTROL REGISTERS 440 

7. FLOATING POINT REGS 450 

8. PREFIX REGISTER 450 

9. PSW 460 
460 

470 

470 


ADDRESS(CHEX) => 40C 


01B9C000 
070C0000 
00000000 
00000000 
00000000 
00000000 


OOFFFFFF 


MORE DATA ABOVE AND BELOW: PRESS BKWD OR FWD. 


COMMAND ==> 


Figure 44. ALTCP Frame (Part 1 of 4) 


Using that result, enter: 


‘aaaaaaaa’ at “Address(hex) = >’ (Figure 45) 


07 NOV 88 15:42:40 


(ALTCP) 


DATA 


01B93B78 


8OFF4E0C 


00000000 


00000000 


00000000 


00000000 


FFF86000 


sa Ws 


to display the contents of the status collection data (SCD) area. 


The contents has the following meaning: 


© Offset X’4-5° = 2-byte device number (hex) of the Hot I/O device 


e Offset X’6’ 
occurred. 


PSW3 


040C0000 


00000000 


00000000 


00000000 


00000000 


070E0000 


00000000 


O1B6F170 
81EDOE72 
00000000 
00000000 
00000000 
00000000 
00000000 


00000000 


(124 


OPERATING 


Hot !/O 


= 1-byte channel path (hex) of the CHPID over which the last Hot I/O interrupt 
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Cee 3 07 NOV 88 15:42:40 
ALTER / DISPLAY CP (ALTCP) 
A= FUNCTION ADDRESS DATA 
1. ALTER 1B6F170 E2C3C440 O8E06C90 9D899A30 BADBD204 
—> 2. DISPLAY 7810170 S$ CD ddddcc 
1B6F180 00030000 40000000 OO0D0006E 00000064 
B= FACILITY 7810180 
1. REAL STORAGE KEY 1B6F190 00000000 01B6C7C8 00000000 00000000 
2. REAL STORAGE 7810190 
—> 3. PRIMARY VIRTUAL STG © 1B6F1A0 00008000 00000000 00000000 00000000 
4. SECONDARY VIRTUAL STG 7810190 
5. GENERAL REGISTERS 1B6F1B0 00000000 00000000 00000000 00000000 
6. CONTROL REGISTERS 781C1B0 
7. FLOATING POINT REGS 1B6F1CO 00000000 00000000 00000000 00000000 
8. PREFIX REGISTER 781C1C0 
9. PSW 1B6F1DO0 00000000 00000000 00000000 00000000 
781C1D0 
1B6F1IEO 00000000 00000000 00000000 00000000 
781C1E0 
ADDRESS( HEX) => 1B6F170 
MORE DATA ABOVE AND BELOW: PRESS BKWD OR FWD. (124 ) 
COMMAND ==> 
3 ..W. W. 5 ..W. PSW3 OPERATING 
Figure 45. ALTCP Frame (Part 2 of 4) 


8. Report the problem. 


9. Then enter: 


30E at “Address(hex) = >” to display location X’30E’ in the PSA (Figure 


10. Stop all CPs (press ALT-STOP keys on the system console). 


11. Then on the system console command line, enter: 


A‘ (Alter) 


An arrow will appear in front of A1 (Figure 47). 


B3 (Primary Virtual Storage) 


An arrow will appear in front of B3. 


46). 


12. Move the cursor to location X°30E’ and type one of the following recovery options: 


‘01’ - Status ts cleared, device remains online. 


‘02’ - Box the device. 


‘04° - Try channel path recovery and, if unsuccessful, the channel path is taken offline. 
‘05° - Force the channel path offline. 


Then press the ENTER key. 
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CPs. 3 


ALTER / DISPLAY CP 


A= FUNCTION 
—> 1. ALTER 
2. DISPLAY 


B= FACILITY 

REAL STORAGE KEY 

REAL STORAGE 

PRIMARY VIRTUAL STG 
SECONDARY VIRTUAL STG 
GENERAL REGISTERS 
CONTROL REGISTERS 

. FLOATING POINT REGS 
PREFIX REGISTER 

PSW 


-_> 


WON DUE WD 


ADDRESS( HEX) => 30E 


ADDRESS 
30E 
30E 
310 
310 
320 
320 
330 
330 
340 
340 
350 
350 
360 
360 
370 
370 


MORE DATA ABOVE AND BELOW: PRESS BKWD 


COMMAND ==> 


Figure 46. ALTCP Frame (Part 3 of 4) 


00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


3 .MW. 4 .MW. 5 


13. Start all CPs (press START Key on the system console). 


14. Restart the CP in wait. 


On the system console command line, type: 


RESTART CPn 


07 NOV 88 15:42:40 


(ALTCP) 


DATA 


00000000 
00000000 
00000000 
00000000 
00000000 
00000000 


00000000 


MW. 


00000003 


00000000 


00000000 


00000000 


00000000 


00000000 


OOOOOBOE 


0000 

02 
O7F7EO7F 
00000000 
00000000 
00000000 
00000000 


00000000 


OBOEOO000 


Hot I/O 


PSW3 000A0000 80000110 


111 


CP: 3 


ALTER / DISPLAY CP 


A= FUNCTION 
—> 1. ALTER 
2. DISPLAY 


B= FACILITY 

REAL STORAGE KEY 

REAL STORAGE 

PRIMARY VIRTUAL STG 
SECONDARY VIRTUAL STG 
GENERAL REGISTERS 

. CONTROL REGISTERS 

. FLOATING POINT REGS 

. PREFIX REGISTER 

. PSW 


=> 


WOMmoOnI DA NM & W PP Ee 


ADDRESS( HEX) => 30E 


ADDRESS 
30E 
30E 
310 
310 
320 
320 
330 
330 
340 
340 
350 
350 
360 
360 
370 
370 


MORE DATA ABOVE AND BELOW: PRESS BKWD 


COMMAND ==> RESTART CP3 


Figure 47. ALTCP Frame (Part 4 of 4) 
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00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


OR FWD. 
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07 NOV 88 15:42:40 


(ALTCP ) 


DATA 


00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


00000000 


00000003 


00000000 


00000000 


00000000 


00000000 


00000000 


OOOOOBOE 


0200 
O7F7EO7F 
00000000 
00000000 
00000000 
00000000 
00000000 


OBOEOO000 


(124 ) 


.W. PSW3 O000A0000 80000110 


16.0 Channel Path Recovery 


Channel path recovery routines are invoked under the following circumstances: 


e During processing of Hot I/O recovery options ’CHP,K’ (attempt channel path recovery) and “CHP,F’ 
(force channel path offline). 


e In response to a “CF CHP(cc),OFFLINE,FORCE’ command. 
e After the Channel Subsystem reports a hardware malfunction on the channel path. 


e =6 As part of the recovery for the notification of a “System Resetting Event’ being sent from a device or 
control unit. 


Channel path recovery processing involves the use of the Reset Channel Path (RCHP) instruction, which 
causes the reserve status of devices to be lost, under the following conditions: 


@ When the channel path to be reset is the last path to a reserved DPS DASD device. 
e When the channel path to be reset is a path to a reserved non-DPS DASD device. 


In order to avoid ‘reserve stealing’ from other sharing systems while channel path recovery Is in PiegnGes: 
the following message Is issued: 


ITOSO62E ERROR ON CHANNEL PATHS — STOP I/O REQUESTS FROM SHARING SYSTEMS 


When sharing systems have been stopped, the operator replies to message IOSO62E to indicate channel 
path recovery can proceed on the detecting system. 


When the RCHP instruction is completed, channel path recovery processing attempts to re-establish re- 
serves that were lost, and re-establishes the DPS arrays for DPS devices on the reset channel path. 


When channel path recovery processing is complete, the operator is notified by the following message: 


IOS201E START PROCESSORS STOPPED FOR MESSAGE IOS062E 


and the sharing systems can be restarted. 


Handling Message IOS062E 


Procedure 


Note that in LPAR mode, the START/STOP/RESTART SLCs and keys operate on the logical processors ot 
the selected partition. 


When message !lOS062E is issued on the detecting system, proceed as follows: 


1. Enter ‘STOP’ at the system console of each sharing system, or logical partition, in the case of LPAR 
mode. 


2. Reply “U’ to message !OS062E on the detecting system. 


If the reply is not accepted because the console is locked out, check at the system console to deter- 
mine whether a wait state ‘062’ has been loaded. If so proceed to section “Handling Wait State 062” 
on page 114 otherwise proceed to the section “Handling Message lOS201E” on page 114. 
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Handling Wait State 062 


If message IOSO62E cannot be issued, or if it times out, a restartable wait state ‘062’ is loaded on the de- 
tecting system. 


Procedure 


Note that in LPAR mode, the START/STOP/RESTART SLCs and keys operate on the logical processors in 
the selected partition. 


1. Enter ‘STOP’ at the system console of each sharing system or logical partition in the case of LPAR 
mode. 


2. Enter “RESTART’ at the system console on the detecting system (the system in the Wait062). 


Handling Message lOS201E 


When channel path recovery is complete, a request to restart the stopped sharing processors is issued 
using one of the following messages: 


IOS201E START PROCESSORS STOPPED FOR MESSAGE IOS062E — RESERVES INTACT 


IOS201E START PROCESSORS STOPPED FOR MESSAGE IO0S062E — RESERVES LOST 


Procedure 


Note that in LPAR mode, the START/STOP/RESTART SLCs and keys operate on the logical processors in 
the target partition. 


If the message indicates RESERVES INTACT proceed as follows:: 


1. Start all stopped sharing systems by pressing the START key at the system console of each sharing 
system or logical partition, in the case of LPAR mode. 


2. Restart the detecting system by replying ‘U’ to message IEE125, which follows message IOS201E. 

If the message indicates RESERVES LOST, reserve(s) cannot be reestablished: 

1. Restart the detecting system by replying ’‘U’ to message IEE125, which follows message IOS201E. 
The device(s) are boxed on the detecting system. 


The following messages may be issued indicating the state of devices: 


TOS444I DYNAMIC PATHING NOT REMOVED FROM DEVICE ddd 
IOS100I DEVICE ddd BOXED, LAST PATH cc LOST, CANNOT RE-—RESERVE 
IOSOOOI ddd,**SIM,**,**06,,,volser, jjjj 

Jobs using the boxed devices may be Abended. 


2. Notify the System Programmer. 


The device(s) should not be made available to sharing systems until after data integrity has been 
checked. 


3. Start all Stopped sharing systems by pressing the START key at the system console of each sharing 
system or logical partition, in the case of LPAR mode. 
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The state of the channel path after recovery is indicated by one of the following messages: 


TOS202I CHANNEL PATH cc FORCED OFFLINE — DEVICE IS: ddd | UNKNOWN 
I0S203I CHANNEL PATH cc SUCCESSFULLY RECOVERED — DEVICE IS ddd | UNKNOWN 


Handling Wait State 114 


If DCCF is not able to write message IOS201E, or if it times out, a restartable wait state ‘114’ is loaded on 
the detecting system. The PSW indicates: 


00O0A0000 00nn0114 


where ‘nn’ contains one of the following values: 
e ‘O01’ 
Reserves are intact. The system has successfully recovered the reserved devices. 
e 02’ 
Reserves are lost. The system has forced offline one or more devices reserved for the system. 
Procedure 
If reserves are intact, proceed as follows: 


In LPAR mode: The START/STOP/RESTART SLCs and keys operate on the logical processors in the se- 
lected partition. 


1. Start all stopped sharing systems by pressing the START key at the system console of each sharing 
system or logical partition, in the case of LPAR mode. 


2. Restart the detecting system by entering ‘RESTART’ at the system console. 
If reserves are lost, proceed as follows: 
1. Restart the detecting system by replying “U’. to message IOS201E. 

The device(s) are boxed on the detecting system. 


The following messages may be issued indicating the state of devices: 


I0S444I DYNAMIC PATHING NOT REMOVED FROM DEVICE ddd 
I0S100I DEVICE ddd BOXED, LAST PATH cc LOST, CANNOT RE-RESERVE 
IOSOOOI ddd,**SIM,**,**06,,,volser,jjjj 


Jobs using the boxed devices may be abended. 


2. Notify the System Programmer. 


The devicé(s) should not be made available to sharing systems until after data integrity has been 
checked. 


3. Start all stopped sharing systems by pressing the START key at the system console of each sharing 
system or logical partition, in the case of LPAR mode. 


The state of the channel path after recovery is indicated by one of the following messages: 


TOS202I CHANNEL PATH cc FORCED OFFLINE — DEVICE IS: ddd | UNKNOWN 
TOS203I CHANNEL PATH cc SUCCESSFULLY RECOVERED — DEVICE IS ddd | UNKNOWN 
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Handling Message 1OS113W and Wait State 113 


If, during channel path recovery processing, the Reset Channel Path (RCHP) instruction has released some 
reserved devices, and device recovery (re-establishing reserves) is incomplete when an unrecoverable 
software error occurs in the recovery code, the following message is issued: 


I0S113W IOS RECOVERY FAILURE — RESERVES MAY HAVE BEEN LOST 


IOS then loads a wait state 113. This wait state is not restartable and an IPL must be performed on this 
system. 


Procedure 
1. Notify the System Programmer. 


Reserved devices may have been released by channel path recovery and may have a data integrity 
problem, so data sets should be verified. 


2. Refer to installation procedures. If none exist, take a stand-alone dump and re-IPL the system in the 
WAIT113. 


Handling Message lOS004I 


If an unrecoverable software error occurs during channel path recovery, but no reserved devices are in- 
volved, the recovery processing terminates with the following message: 


ITOSO04I IOS RECOVERY FAILURE — AVAILABILITY OF I/O DEVICES UNKNOWN 


The system continues processing, but if critical devices are not available for use, system performance may 
be degraded. 


Procedure 
1. Notify the System Programmer. 


2. If other installation procedures fail, let the system complete as much work as possible, and schedule 
an IPL. 
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17.0 I/O Hang Conditions 


I/O hang conditions may occur as a result of a channel path error or any other type of error that makes the 
channel path unavailable to an I/O device. Two types of hang condition are considered: 


1. Storage Director hang - No device on any channel path through that storage director can be accessed. 


2. Device hang - A device appears to be busy and cannot be accessed by any sharing system. 


Recognizing 1/0 Hang Conditions 
I/O hang conditions may be indicated by the following missing interrupt detection messages. 


IOS0711I ddd,START PENDING 
IOSO0711I ddd,MISSING CHANNEL AND DEVICE END 


Refer to Figure 31 and Figure 32 and if the problem cannot be resolved using these charts, use the fol- 
lowing procedure. 


An !/O hang condition can also be identified by using: 
e The Device Status Display frame, as follows: 

1. Atthe system console enter ’F IOPD’. 

2. Enter “A2’ to select device status. 


3. Enter the CHPID in the proper field. The device or the range of devices in the hang condition can 
be identified by a long-lasting Pending Status (P). 


e 3880 Operator Panel- status display 


The Status Pending light of the Storage Director indicates unfinished work between a specific channel 
path and a specific device. The storage director does not allow selection unless the channel or device 
connection is requested. The storage director responds to any other selection attempt with a control 
unit busy condition. 


Unless it is busy, the storage director requests service to clear pending status. Status is cleared when 
the signal is presented to, and accepted, by the channel. 


You can determine which system is causing the problem by looking at the Process and Wait lights on 
the 3880 Operator panel: 


» The Wait and Process lights blink on and off once per second to indicate the hung channel. The 
lights blink once for channel A, twice for channel B, and so on. 


=» The blinking sequence is followed by a 5-second pause and is then repeated until the status 
pending hang condition is cleared. 


Note: When the processor complex is operating in LPAR mode, the lIOPD frames (except Channel Sum- 
mary Status) are partition sensitive. Refer to IBM 3090 Operator Controls for the System Console,,. 
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Handling 110 Hang Conditions 


Procedure 


To handle I/O hang conditions proceed as follows: 


1. 
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Try to vary the failing path offline: 

VARY PATH(ddd-ddd,cc),OFFLINE 
If the storage director is hung, the “VARY PATH’ command is likely to time out. 
If the “VARY PATH’ command does not work, then try: 

CF CHP(xx), OFFLINE,FORCE 


lf unsuccessful, first vary ALL paths to all sharing systems offline and then do a manual I/O system 
reset for the channel path that caused the hang situation. This action can be tried to make the hung 
control unit available to the sharing systems. It should be used only in emergency situations. 


At the system console associated with the system that has the hung channel interface, issue the fol- 
lowing command: 


IFRST cc 
where ‘cc’ is the CHPID to be reset. This command resets all storage directors attached to this CHPID. 


Note: Any reset can cause data integrity exposures if used on non-DPS devices, or on DPS devices 
without an active alternate path. 


If the hang is still not cleared, reset the system, either by using option ‘03° (SYSTEM RESET) on the 
OPRCTL frame at the system console, or by IPLing. Both actions reset all channels. 


If the I/O system reset from the system console cannot be executed, use the following alternative: 
a. POWER-ON RESET the ES/3090 
b. POWER OFF/ON the ES/3090 


If a hung controller condition is resolved with an I/O System Reset, some DASD devices may have lost 
their reserved status and should not be made accessible to the sharing systems until data has been 
recovered. 


If the CU is still in hang condition, an IML of the CU must be performed. IML of an active 3880 Storage 
Director without IPL of the attached host systems can lead to data integrity problems. (In this case, 
an active 3880 SD means one with channel paths online.) | 
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18.0 GRS 


Global Resource Serialization (GRS) is a component of every MVS/SP system. GRS allows DASD ‘re- 
sources’ such as data sets, catalogs, and so on, to be shared between the several systems that constitute 
a GRS Complex. 


Sharing is done without having to issue a hardware reserve for the total volume from the requesting sys- 
tem, which only wants a small part of the volume (for example, a data set). 


The systems in the GRS complex are connected by CTCs that form a GRS ring. 


GRS commands (‘Display GRS’ and ‘Vary GRS’) are documented in MVS/ESA Operations: System Com- 
mands. GRS messages can be found in MVS/ESA Message Library: System Messages, Volume 1, and 
MVS/ESA Message Library: System Messages, Volume 2. 


This chapter provides guidelines for: 
e Reconfiguring a GRS ring 


e Recovering from GRS ring disruption 


GRS Ring Reconfiguration 


It may be necessary to reconfigure a GRS ring for the following reasons: 


e During partitioning of ES/3090 MP models when the ‘IN-USE’ CTC is using a CHPID attached to the 
‘off-going’ side. 


e When you have to isolate the 3088 for servicing. 


In either case, it is also necessary to consider the other users of the 3088, which may include VTAM (NET), 
JES2/NJE, or JES3. 


The following procedure is used to reconfigure the GRS ring: 


1. Use the ‘D GRS’ command to determine the CTC device numbers of the ‘IN-USE’ and ‘ALTERNATE’ 
link(s) between this system and other systems in the GRS complex. 


This procedure allows the ‘IN-USE’ link and the “ALTERNATE’ link(s) attached to: 
e The off-going side (MP case), or 
e An IBM 3088 (IBM CE maintenance case) 
to be taken offline and another “ALTERNATE’ link to become the ‘IN-USE’ link. 
2. Use the following commands, or customer documentation, to determine: 


e For partitioning, the CHPIDs used by the CTC’s on this GRS system: 


D M=DEV(ddd) where 'ddd' is a GRS CTC device 


e For maintenance, the address range of the IBM 3088 for which the CTC is just one of the ad- 
dresses in the range. Use customer provided documentation. 


3. Issue the following command against all ALTERNATE or QUIET CTCs connected to the off-going side 
of the processor {MP case) or the 3088 (maintenance case): | 


V ddd ,OFFLINE 
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where ‘ddd’ is the device number of the CTC, 
This results in the CTC pending offline message: 
MSGIEE794I ddd PENDING OFFLINE 


If the CTC to be taken offline is the ‘IN-USE’ CTC, then issue the following commands: 
V GRS(sysname) ,QUIESCE 

where ‘sysname’ in the name of the system to be quiesced. 
V ddd,OFFLINE 


where ‘ddd’ is the device number of the CTC that is connected to the off-going side. 


This results in the CTC pending offline message: 
MSGIEE794I ddd PENDING OFFLINE 
Issue the following: 
S DEALLOC or account-—dependent deallocation procedure 


This results in the CTC being taken offline. 
The following message is issued: 

MSGISGO47I CTC ddd DISABLED 
Note. If the CTC does not go offline, check for ENQ contention on resource “SYSIEFSD’. The “VARY 
OFFLINE’ command may be enqueued behind another task whose ‘Vary’ operation cannot complete 
because of an outstanding global enqueue. The global enqueue cannot be resolved since GRS has 


now been quiesced. It may be necessary to first restart GRS, resolve the global enqueue, and then 
proceed with the command : 


V GRS(sysname) ,QUIESCE 


This item is an informational step. 


Issue the command: 


D GRS,LINK 
to display the status of the GRS CTCs. The D GRS,LINK display should show that all the CTCs, that 


were previously labelled as ‘IN-USE’ or ‘ALTERNATEs’, attached to CHPIDs on the off-going side now 
have a status of ‘DISABLED’. 


If the command: 


V GRS(sysname) ,QUIESCE 


has been issued before (in a previous item). 


Issue the following command: 


V GRS(sysname) ,RESTART 
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where sysname is the system to be restarted, on any system to resume GRS communication to the 
target system. GRS will resume communication of global requests using a GRS CTC that is attached 


to a CHP on the on-going side. 


8. Issue the following command to determine whether there are other users on the same 3088 from this 


system: 
D U, ,ALLOC ,dd0,32|64 
The resulting display will be in the form: 


IEE106I 08.10.00 UNITS ALLOCATED 255 


UNIT JOBNAME ASID 
CCO NET OOA 
CC6 JES2 009 


9. Use the relevant owning-subsystem commands to terminate allocation of the 3088 devices listed in the 


display. 


10. Once all allocations to the 3088 address range devices have been terminated, the MVS command: 


VARY ddd-eee, OFFLINE 


may be issued to take all the 3088 devices offline. 


Normal MP reconfiguration procedures may now be performed. 


Warning- If an IBM 3088 is being removed for maintenance related reasons, it must be taken offline 


from all of its attached systems. 


Note that when a GRS CTC device is later returned to use, as soon as it is brought online, it will be allo- 


cated to GRS. The following message will be issued: 


ISGO0471 CTC ddd ENABLED 


Recovering from GRS Ring Disruption 


A GRS ring disruption may occur for one of the following reasons: 


e A failure in one system in the ring which prevents that system from responding to other systems over 


the GRS CTC, including: 


a An operational problem (for example, processor stopped). 


» A customer configuration design problem (such as, device contention on the same CHPID as GRS 


CTC link). 
» A general software failure (for example, a disabled loop or Abend). 
6 A hardware failure (for example, loss of CHPID or CTC). 


¢ A software failure causing an Abend in the GRS code. 


Ring Disruption Detection 


GRS ring disruption may be indicated by the following messages: 


*TSGO23E GLOBAL RESOURCE SERIALIZATION DISRUPTED 
GLOBAL RESOURCE REQUESTORS WILL BE SUSPENDED 


GRS 
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or 
ISGO46E CTC ddd DISABLED DUE TO HARDWARE|SOFTWARE ERROR CODE=rc 
followed by: 


ISGO22E SYSTEM sysname DISRUPTED GLOBAL RESOURCE SERIALIZATION DUE TO 
COMMUNICATION FAILURE — GLOBAL RESOURCE REQUESTORS WILL BE SUSPENDED 


or 
ISG021I fc-re ERROR IN GLOBAL RESOURCE SERIALIZATION FUNCTION 
followed by: 


ISGO22E SYSTEM sysname DISRUPTED GLOBAL RESOURCE SERIALIZATION DUE TO 
SOFTWARE FAILURE — GLOBAL RESOURCE REQUESTORS WILL BE SUSPENDED 


Ring Disruption Recovery 


GRS RESTART requires greater than one-half of the systems in the original ring to be able to restart. This 
is true of all rings of three or more systems. 


A disrupted GRS ring may be rebuilt using the command: 


VARY GRS(CALL) ,RESTART 


A GRS& ring may be rebuilt after a disruption in one of two ways: 
1. Automatic restart 


Initiated by GRS automatically after detection of ring disruption. This is the recommended method of 
GRS RESTART. In this case, GRS internally issues the “VARY GRS(xxx),RESTART’ command. 


2. Operator initiated restart 


Recommended only after automatic restart has failed to rebuild the ring. In this case, the operator 
issues the “VARY GRS(sysname),RESTART’ command from just one of the systems to each of the 
other systems. 3 


Automatic Restart Considerations 


When ‘automatic restart’ capability is specified for a system in the GRSCNFxx member of PARMLIB, that 
system, if capable, will automatically initiate restart processing (that is, attempt to rebuild the ring) after 
a GRS ring disruption has been detected. 


_ Ina GRS complex consisting of three or more systems, automatic restart may be specified for any system, 
but it is recommended that restart be always coded for the most critical system. 


When it is possible to initiate GRS RESTART on multiple systems in the ring, either through automatic re- 
_ start or operator-initiated restart, care must be taken that split rings (separate, independent GRS rings) 
do not form after a disruption. 


Warning: It is recommended that following a ring disruption, GRS should be permitted to restart auto- 
matically; that is, without operator intervention. However, if GRS automatic restart fails to rebuild the ring, 
Operator initiated restart is required. In that case, any system whose automatic restart capability failed to 
respond should be “SYSTEM RESET’ before the operator enters the “VARY GRS(xxx), RESTART’ command. 


The following messages are displayed to indicate that automatic restart has been initiated by GRS: 
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ISGO24T SYSTEM sysname INITIATED AUTO RESTART PROCESSING 
INTERNAL VARY GRSCALL),RESTART CFA/CDD/077B8781/077B87A2/SCSDS/ISGBTC 
6C000028/98156101/. 


The following messages are issued to indicate that the GRS ring is being rebuilt for each system that was 
in the ring at the time of the disruption: 


ISGO011I SYSTEM sysname — RESTARTING GLOBAL RESOURCE SERIALIZATION 
ISGO13I SYSTEM sysname — RESTARTED GLOBAL RESOURCE SERIALIZATION 


Note that the systems programmer can specify the “REJOIN’ option for a system in the GRSCNFxx member 
of SYS1.PARMLIB. REJOIN(YES) allows the system to automatically rejoin the ring when it resumes 
processing, and no operator intervention is required. However, if the system programmer specified 
REJOIN(NO), the operator must bring the system back into the ring when the system resumes processing. 


When a GRS& ring disruption is caused by the failure of a system that must then be re-|lPLed, that system 
must be purged from the GR&S ring (by issuing the command ‘VARY GRS(sysname),PURGE’ from a system 
active in the GRS ring) before the system is re-IPLed. 


Account procedures must take into consideration that the failed system may have been holding exclusive 
use of resources at the time of the failure, and that purging the system from the GRS ring may now cause 
a data integrity exposure. Once the system holding the resource is purged from the GRS complex, the 
resource then becomes available for use by other systems. However, at the time of the system failure, the 
resource update may not have been completed. The operator, following installation procedures, must 
decide whether it is necessary to cancel any jobs waiting for use of the resource, thereby preventing the 
potential for a data integrity problem. 


Operator-initiated Restart Considerations 


Have the GRS configuration diagram available for review during the recovery procedures. 


When the operator enters the “VARY GRS(sysname),RESTART’ command, the following messages may be 
issued: 


ISGO261 SYSTEM sysname MAY CREATE A SPLIT RING IF ANY OTHER GRS 
SYSTEM IS ACTIVE, VERIFY THAT NO GRS SYSTEM IS ACTIVE 
BEFORE CONFIRMING RESTART 

ISGO027D CONFIRM RESTART — RING FOR SYSTEM sysname — REPLY NO OR YES 


It is necessary to purge the disrupting system from the GRS complex, but prior to purging a disrupting 
system from the GRS complex, the operator should determine if there were any resources held exclusively 
by the disrupting system that were also being ENQ requested by any of the active systems. Issue the fol- 
lowing command from any of the active GRS systems: 


D GRS,C 


Observe whether the disrupting system ‘system name’ appears in the contention display. 


Account procedures must take into consideration that the failed system may have been holding exclusive 
use of resources at the time of the failure, and that purging the system from the GRS ring may now cause 
a data integrity exposure. Once the system holding the resource is purged from the GRS complex, the 
resource then becomes available for use by other systems. However, at the time of the system failure, the 
resource update may not have been completed. The operator, following installation procedures, must 
decide whether it is necessary to cancel any jobs waiting for use of the resource, thereby preventing the 
potential for a data integrity problem. 


Now that a data integrity exposure has been prevented, purge the disrupting system from the GRS complex 
by issuing the following command from an active system: 


GRS 123 


V GRS(sysname) , PURGE 


Bring the disrupting system back into the GRS complex (IPL), after first having performed software failure 
recovery procedures, for example, by taking a SADUMP. 


Refer to the GRS manuals to build (or rebuild) the GRS complex. 
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19.0 Alerts 


Configuration Alert - [OS163A 


Warning. Serious and prolonged performance degradation can result if this problem is not correctly re- 
solved. 


All occurrences of 1|OS163A messages must be addressed, and the causes correctly identified. 


A configuration alert is a result of: 


An interrupt from a device for which there is no matching subchannel. 
or 


The lOCDS contained at least two IODEVICE macroinstructions (device definitions) for a specified de- 
vice number, where each device number was assigned to a separate partition but they are currently 
configured in the same partition. 


MVS informs the Operator by message: 


T0S163A CHPID nn ALERT, NO ASSOCIATED SUBCHANNEL FOR DEVICE 


Procedure 


To handle a configuration alert, proceed as follows: 


qe 


Display the 3090 log at the system console by pressing the VIEW LOG key. Find the associated mes- 
sage. 


Go to step 3 if the message is: 
INTERRUPT FROM DEVICE NOT IN IOCDS CHPID= nn UA=xx (25197) 
Go to step 4 if the message is: 
XXXXXXXX XXXX xxxx on CHPID nn DUPLICATE EXISTING NUMBERS (51574) 


This condition may occur when: 
e A device has been installed but not defined in the current |[OCDS. 
e A device has been cabled or defined incorrectly. 


Whatever the cause of the mismatch, it represents a potentially serious performance situation. The 
first interrupt received by the Channel Subsystem is alerted to MVS through a machine check inter- 
rupt. MVS/SP V2 and V3 then reports this condition to the operator by message IOS163A. 


However, any subsequent interrupts are handled only by the Channel Subsystem an not reported to 
MVS. Continuous interrupts use up the resources of the Channel Subsystem and channel. Review the 
lOCDS and contact the hardware service representative. 


Each device definition within a partition must have a unique device number. The IOCP reports the 
duplicate device numbers and the devices they represent. The IOPD frame can be used to show the 
device number and its corresponding channel paths for the specified partition. Contact the System 
Programmer. . 
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This is caused if the CHPID paths supporting the duplicate device numbers are both brought into the same 
partition. Determine from the 3090 message 51574 which channel was brought into the configuration. 
Consider configuring this CHPID offline. Report this condition to the System Programmer. 


Malfunction Alert - [OS162A 


A malfunction alert is the result of a hardware error on a device, control unit, or channel which prevents 
the Channel Subsystem from recognizing the unit address of the device generating the interrupt. The 
Channel Subsystem generates a CRW (Channel Report Word) to inform MVS. Then MVS notifies the op- 
erator of the condition by issuing message: 


I0S162A CHPID xx ALERT, UNSOLICITED MALFUNCTION ALERT 


Procedure 
Report the problem to your hardware service representative. 


Whatever the cause of a malfunction alert, it represents a potentially serious performance degradation, 
and should be addressed immediately. 
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